Table of Contents
Fetching ...

JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction

Yuhao Zhan, Yuqing Zhang, Jing Yuan, Qixiang Ma, Zhiqi Yang, Yu Gu, Zemin Liu, Fei Wu

TL;DR

This work tackles the limited reference diversity in Grammatical Error Correction (GEC) by introducing JELV, a scalable framework for automated edit-level validity assessment. It presents PEVData, a human-annotated benchmark, and implements two JELV variants: a high-accuracy LLM-based pipeline (JELV1.0) and a distilled DeBERTa classifier (JELV2.0) for efficiency. The authors demonstrate two main applications: (i) improving evaluation reliability through JELV-driven reclassification of false positives and fluency integration, and (ii) enabling automatic reference expansion via a generation-then-filtering pipeline that retrains top GEC systems on expanded data, yielding measurable gains. Together, JELV offers a scalable path to richer reference diversity, stronger evaluation alignment with human judgments, and improved model generalization in GEC.

Abstract

Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.

JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction

TL;DR

This work tackles the limited reference diversity in Grammatical Error Correction (GEC) by introducing JELV, a scalable framework for automated edit-level validity assessment. It presents PEVData, a human-annotated benchmark, and implements two JELV variants: a high-accuracy LLM-based pipeline (JELV1.0) and a distilled DeBERTa classifier (JELV2.0) for efficiency. The authors demonstrate two main applications: (i) improving evaluation reliability through JELV-driven reclassification of false positives and fluency integration, and (ii) enabling automatic reference expansion via a generation-then-filtering pipeline that retrains top GEC systems on expanded data, yielding measurable gains. Together, JELV offers a scalable path to richer reference diversity, stronger evaluation alignment with human judgments, and improved model generalization in GEC.

Abstract

Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.

Paper Structure

This paper contains 47 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison between biased edit‐level evaluation using a single reference and our automated expansion of valid edits via the Judge of Edit‐Level Validity (JELV). Src., Hyp. and Ref. denote the source, hypothesis and reference sentences, respectively. We form two sentence pairs, $\mathbf{P_1}$ and $\mathbf{P_2}$. In each pair, $\mathbf{S_1}$ (from Src.) and $\mathbf{S_2}$ (from Hyp.) differ only in the single edited segment, remaining identical to the reference elsewhere.
  • Figure 2: Three criteria for judging edit-level validity.
  • Figure 3: Overview of the JELV workflow. Starting from few reference sets, we extract candidate sentence pairs and process them in two independent streams. One stream is manually annotated by experts to create the PEVData. The other is evaluated by a three turn LLM as Judges pipeline (JELV1.0) and the resulting labels are distilled into a DeBERTa classifier (JELV2.0).
  • Figure 4: Ablation of JELV1.0 strategies on PEVData.
  • Figure 5: Overview of the comprehensive evaluation metric. Edits flagged as FPs are first reclassified via JELV2.0, then false positive decoupling distinguishes the remaining $\mathrm{FP_{{noc}}}$ and $\mathrm{FP_{{oc}}}$ to compute the edit‐level generalized F‐score $\mathrm{F_{\beta\text{-}G}}$. In parallel, GPT-2 perplexity produces the sentence‐level fluency score $f(x)$ for the hypothesis. A interpolation weight $\gamma$ combines these two streams into the final metric $F(x)$.
  • ...and 3 more figures