Revisiting Meta-evaluation for Grammatical Error Correction
Masamune Kobayashi, Masato Mita, Mamoru Komachi
TL;DR
Revisiting Meta-evaluation for GEC tackles biases in English GEC metric meta-evaluation caused by granularity misalignment and reliance on outdated, classical systems. The authors introduce SEEDA, a two-granularity meta-evaluation dataset built from corrections by 12 neural systems (including LLMs) and two human references, enabling robust analysis of metric-human correlations. They show that aligning metric granularity with human evaluation improves sentence-level correlations, while correlations tend to drop when transitioning from classical to neural systems, revealing weaknesses in traditional metrics for fluent, multi-edit corrections. The work offers practical guidelines—use both edit-based and sentence-based metrics, analyze multiple system sets, and consider outliers and fluency references—to achieve valid, forward-looking meta-evaluation in GEC and better assess modern GEC models.
Abstract
Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges including biases caused by inconsistencies in evaluation granularity, and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models (LLMs), and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation, suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.
