back to home

April 28 2026

LEOPARD: Precision-Centric Grammatical Error Correction

LEOPARD is a multi-agent LLM framework for Grammatical Error Correction (GEC) — the task behind automated grammar tutors for language learners. It was accepted at the IEEE International Conference on Frontiers of Engineering and Emerging Technologies (FET 2026, Bahrain).

The problem: the "Fluency Paradox"

When you ask a large language model to fix a learner's sentence, it tends to do too much. Instead of correcting the specific grammatical error, it rewrites valid-but-non-native phrasing into polished, professional English. This is the Fluency Paradox: the model prioritizes native-like fluency over a faithful diagnosis of the student's actual mistake.

For a grammar tutor this is counterproductive — it erases the learner's original voice, conflates style preference with real error, and gives feedback on the wrong thing. In standard GEC metrics this shows up as a precision failure: the model "fixes" things that were never broken, generating false positives. For example, changing "I arrived to the station""I reached the station" is fluent but isn't the grammatical fix the learner needed.

The idea: decouple generation from verification

LEOPARD splits correction into two specialized agents rather than relying on one model to both write and police itself:

  1. The Boss Agent (Generator) — a high-capacity LLM (GPT-4.1) prompted as a "senior linguistic expert" that proposes fluent corrections with a focus on naturalness.
  2. The Purist Agent (Veto) — a precision-focused gatekeeper that decomposes the Boss's output into discrete edits and systematically vetoes any edit that is stylistic rather than grammatically necessary.

Between them, ERRANT aligns the original and corrected sentences and labels each change with an error type (e.g. R:PREP, M:DET, R:NOUN), so the system reasons about a set of independent edit proposals instead of one monolithic rewrite.

How the Purist vetoes edits (3-stage cascade)

Each edit runs through an escalating filter:

Only allowed edits are applied to the original sentence; vetoed edits revert to the learner's original wording, preserving their voice and sentence structure.

Results

On the CLC-FCE benchmark (Cambridge learner-English exam scripts), compared to a zero-shot GPT-4.1 baseline:

Why it matters

LEOPARD shows you don't need a specialized, expensively-trained GEC model to get high precision — you can architecturally constrain a general-purpose LLM instead. Decoupling generative capability from a deterministic verification step is the key to recovering the precision needed to deploy trustworthy, voice-preserving AI tutors. Orchestration, not retraining, does the work.