Nazmus Ashrafi

LEOPARD: Precision-Centric Grammatical Error Correction

LEOPARD is a multi-agent LLM framework for Grammatical Error Correction (GEC) — the task behind automated grammar tutors for language learners. It was accepted at the IEEE International Conference on Frontiers of Engineering and Emerging Technologies (FET 2026, Bahrain).

The problem: the "Fluency Paradox"

When you ask a large language model to fix a learner's sentence, it tends to do too much. Instead of correcting the specific grammatical error, it rewrites valid-but-non-native phrasing into polished, professional English. This is the Fluency Paradox: the model prioritizes native-like fluency over a faithful diagnosis of the student's actual mistake.

For a grammar tutor this is counterproductive — it erases the learner's original voice, conflates style preference with real error, and gives feedback on the wrong thing. In standard GEC metrics this shows up as a precision failure: the model "fixes" things that were never broken, generating false positives. For example, changing "I arrived to the station" → "I reached the station" is fluent but isn't the grammatical fix the learner needed.

The idea: decouple generation from verification

LEOPARD splits correction into two specialized agents rather than relying on one model to both write and police itself:

The Boss Agent (Generator) — a high-capacity LLM (GPT-4.1) prompted as a "senior linguistic expert" that proposes fluent corrections with a focus on naturalness.
The Purist Agent (Veto) — a precision-focused gatekeeper that decomposes the Boss's output into discrete edits and systematically vetoes any edit that is stylistic rather than grammatically necessary.

Between them, ERRANT aligns the original and corrected sentences and labels each change with an error type (e.g. R:PREP, M:DET, R:NOUN), so the system reasons about a set of independent edit proposals instead of one monolithic rewrite.

How the Purist vetoes edits (3-stage cascade)

Each edit runs through an escalating filter:

Stage 1 — Fast Pass (deterministic): purely mechanical edits (punctuation, spelling/orthography) are auto-approved, saving compute.
Stage 2 — Heuristic Risk/Overkill filter: catches hallucinations and unnecessary rewrites using an expansion check, a Levenshtein semantic-distance check, and a closed-class whitelist (prepositions, determiners, pronouns) so essential function-word fixes aren't wrongly rejected.
Stage 3 — LLM Judge: remaining edits go to a cheaper LLM (GPT-4o-mini) with a strict rubric — "if the original is already grammatically correct, you MUST veto." This biases the system strongly toward precision.

Only allowed edits are applied to the original sentence; vetoed edits revert to the learner's original wording, preserving their voice and sentence structure.

Results

On the CLC-FCE benchmark (Cambridge learner-English exam scripts), compared to a zero-shot GPT-4.1 baseline:

F0.5 improved by 14.6% (the GEC-standard metric, which weights precision twice as heavily as recall).
Precision rose from 0.222 to 0.255, with false positives cut from 101 to 92 — the Purist successfully filters valid-but-unnecessary edits without aggressively sacrificing recall (recall actually rose too).
Per-error-type gains were strongest where structural verification helps most: +12.5% on missing determiners and +25% on noun-logic corrections.

Why it matters

LEOPARD shows you don't need a specialized, expensively-trained GEC model to get high precision — you can architecturally constrain a general-purpose LLM instead. Decoupling generative capability from a deterministic verification step is the key to recovering the precision needed to deploy trustworthy, voice-preserving AI tutors. Orchestration, not retraining, does the work.

Status: Accepted for publication, IEEE FET 2026 (Bahrain) · Scopus-indexed
Code: github.com/nazmus-ashrafi/LEOPARD-GEC
Tech: Python, LangChain, LangGraph, OpenAI API, spaCy, ERRANT, asyncio