CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

Jingheng Ye; Zishan Xu; Yinghui Li; Linlin Song; Qingyu Zhou; Hai-Tao Zheng; Ying Shen; Wenhao Jiang; Hong-Gee Kim; Ruitong Liu; Xin Su; Zifei Shan

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

Jingheng Ye, Zishan Xu, Yinghui Li, Linlin Song, Qingyu Zhou, Hai-Tao Zheng, Ying Shen, Wenhao Jiang, Hong-Gee Kim, Ruitong Liu, Xin Su, Zifei Shan

TL;DR

CLEME2.0 introduces an interpretable, reference-based GEC evaluation framework that decomposes edits into four disentangled aspects—hit-correction, wrong-correction, under-correction, and over-correction—and aggregates them with adjustable weights. It employs a chunk-partition edit extraction and two weighting strategies (similarity-based and LLM-based) to better capture semantic impact, formalized as Score = $\alpha_1 \cdot Hit + \alpha_2 (1 - Wrong) + \alpha_3 (1 - Under) + \alpha_4 (1 - Over)$. Across two human judgment datasets (GJG15, SEEDA) and six reference sets, CLEME2.0 achieves state-of-the-art correlations with human judgments and demonstrates robustness to different annotation styles, with similarity-based weighting generally outperforming LLM-based weighting. The work provides actionable diagnostics for GEC system development and highlights the importance of semantic weighting of edits over conventional PRF-based metrics.

Abstract

The paper focuses on the interpretability of Grammatical Error Correction (GEC) evaluation metrics, which received little attention in previous studies. To bridge the gap, we introduce **CLEME2.0**, a reference-based metric describing four fundamental aspects of GEC systems: hit-correction, wrong-correction, under-correction, and over-correction. They collectively contribute to exposing critical qualities and locating drawbacks of GEC systems. Evaluating systems by combining these aspects also leads to superior human consistency over other reference-based and reference-less metrics. Extensive experiments on two human judgment datasets and six reference datasets demonstrate the effectiveness and robustness of our method, achieving a new state-of-the-art result. Our codes are released at https://github.com/THUKElab/CLEME.

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

TL;DR

. Across two human judgment datasets (GJG15, SEEDA) and six reference sets, CLEME2.0 achieves state-of-the-art correlations with human judgments and demonstrates robustness to different annotation styles, with similarity-based weighting generally outperforming LLM-based weighting. The work provides actionable diagnostics for GEC system development and highlights the importance of semantic weighting of edits over conventional PRF-based metrics.

Abstract

Paper Structure (51 sections, 8 equations, 3 figures, 12 tables)

This paper contains 51 sections, 8 equations, 3 figures, 12 tables.

Introduction
Related Work
Reference-based metrics.
Reference-less metrics.
Method
Edit Extraction
Disentangled Scores
Hit-correction score.
Wrong-correction score.
Under-correction score.
Over-correction score.
Comprehensive Score
Edit Weighting
Similarity-based weighting.
LLM-based weighting.
...and 36 more sections

Figures (3)

Figure 1: An example of CLEME2.0. We highlight TP, FP, FP$_{\text{ne}}$, FP$_{\text{un}}$, and FN in different colors.
Figure 2: Overview of CLEME2.0. Initially, we extract edits and categorize hypothesis edits as TP, FN, FP$_{\text{ne}}$, and FP$_{\text{un}}$. Next, we compute four distinct scores. Finally, we integrate these scores into an overall score utilizing one of the edit weighting techniques.
Figure 3: Prompt of LLM-based weighting.

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

TL;DR

Abstract

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)