Table of Contents
Fetching ...

Revisiting Meta-evaluation for Grammatical Error Correction

Masamune Kobayashi, Masato Mita, Mamoru Komachi

TL;DR

Revisiting Meta-evaluation for GEC tackles biases in English GEC metric meta-evaluation caused by granularity misalignment and reliance on outdated, classical systems. The authors introduce SEEDA, a two-granularity meta-evaluation dataset built from corrections by 12 neural systems (including LLMs) and two human references, enabling robust analysis of metric-human correlations. They show that aligning metric granularity with human evaluation improves sentence-level correlations, while correlations tend to drop when transitioning from classical to neural systems, revealing weaknesses in traditional metrics for fluent, multi-edit corrections. The work offers practical guidelines—use both edit-based and sentence-based metrics, analyze multiple system sets, and consider outliers and fluency references—to achieve valid, forward-looking meta-evaluation in GEC and better assess modern GEC models.

Abstract

Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges including biases caused by inconsistencies in evaluation granularity, and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models (LLMs), and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation, suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.

Revisiting Meta-evaluation for Grammatical Error Correction

TL;DR

Revisiting Meta-evaluation for GEC tackles biases in English GEC metric meta-evaluation caused by granularity misalignment and reliance on outdated, classical systems. The authors introduce SEEDA, a two-granularity meta-evaluation dataset built from corrections by 12 neural systems (including LLMs) and two human references, enabling robust analysis of metric-human correlations. They show that aligning metric granularity with human evaluation improves sentence-level correlations, while correlations tend to drop when transitioning from classical to neural systems, revealing weaknesses in traditional metrics for fluent, multi-edit corrections. The work offers practical guidelines—use both edit-based and sentence-based metrics, analyze multiple system sets, and consider outliers and fluency references—to achieve valid, forward-looking meta-evaluation in GEC and better assess modern GEC models.

Abstract

Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges including biases caused by inconsistencies in evaluation granularity, and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models (LLMs), and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation, suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.
Paper Structure (46 sections, 5 figures, 8 tables)

This paper contains 46 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: $M^{2}$ Score ($F_{0.5}$) and word edit rate for classical systems in GJG15, neural systems in SEEDA, and human sentences. These neural systems generate more edits and better corrections compared to classical systems.
  • Figure 2: An overview of the annotation flow and an example of edit-based human evaluation. In Step 1, the annotator identifies errors in the source. Then, they categorize each edit in the output as either valid or not. In Step 2, the annotator determines whether each edit in the output effectively corrects the errors found in Step 1. TP, FP, and FN represent True Positive, False Positive, and False Negative, respectively.
  • Figure 3: Scatter plots of the human score and the metric score. "Base" indicates the 12 systems excluding uncorrected sentences (INPUT) and fluent sentences (REF-F, GPT-3.5). Each line represents a regression line, and the shaded area indicates the size of the confidence interval for the estimated regression, obtained using bootstrap. Comparing the orange and blue regression lines to the gray regression line allows us to observe the degree of influence of each outlier on the distribution trend. For example, the leftward tilt of the orange regression lines for $M^{2}$, PT-$M^{2}$, ERRANT, and GLEU indicates a negative impact from fluent sentences as outliers.
  • Figure 4: Variation of correlation when different systems are considered using window analysis. The x-axis represents the human ranking of the 12 systems excluding outliers. "n" denotes the number of systems considered, with solid lines representing four systems and dashed lines representing eight systems. For example, for n$=$4, a point with x$=$5 corresponds to a human evaluation using systems ranked 2 to 5. The orange line represents Pearson (r) and the blue line represents Kendall ($\tau$). The correlation of the main metrics ($M^{2}$, ERRANT, GLEU) shows significant variability, while pretraining-based metrics (SOME, IMPARA) exhibit relatively stable correlations.
  • Figure 5: Screenshot of doccano used in the edit-based human evaluation.