Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation

Takumi Goto; Yusuke Sakai; Taro Watanabe

Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation

Takumi Goto, Yusuke Sakai, Taro Watanabe

TL;DR

This paper tackles the core challenge of evaluating grammatical error correction (GEC) systems by moving beyond token-level embedding similarity to edit-level similarity. It introduces edit vectors that encode the semantic impact of individual edits and uses unbalanced optimal transport (UOT) to transport these vectors from a hypothesis to a reference, producing an interpretable soft edit alignment. The proposed UOT-ERRANT metric derives precision, recall, and $F_{\beta}$ from the transport plan, allowing robust system ranking across diverse reference sets and improving correlations with human judgments, particularly in fluency-heavy domains. Although computationally heavier, the method offers valuable analytical capabilities for debugging and understanding GEC systems and is validated on SEEDA and GMEG-Data meta-evaluation tasks with strong performance in +Fluency scenarios. The approach also invites broader applications of edit vectors to other editing tasks and evaluation contexts.

Abstract

Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best-performing systems. Currently, reference-based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT-ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from https://github.com/gotutiyan/uot-errant.

Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation

TL;DR

from the transport plan, allowing robust system ranking across diverse reference sets and improving correlations with human judgments, particularly in fluency-heavy domains. Although computationally heavier, the method offers valuable analytical capabilities for debugging and understanding GEC systems and is validated on SEEDA and GMEG-Data meta-evaluation tasks with strong performance in +Fluency scenarios. The approach also invites broader applications of edit vectors to other editing tasks and evaluation contexts.

Abstract

Paper Structure (47 sections, 13 equations, 9 figures, 3 tables)

This paper contains 47 sections, 13 equations, 9 figures, 3 tables.

Introduction
Related Work
Edit-level GEC Evaluation
Quantifying Edits
NLP Applications of Optimal Transport
Proposed Methods
Edit Vector
Edit-level Similarity via Optimal Transport
Discrete Optimal Transport
Balanced Optimal Transport (BOT)
Unbalanced Optimal Transport (UOT)
Connection between UOT and GEC
Proposed Metric: UOT-ERRANT
Experiments
Meta-evaluation Settings
...and 32 more sections

Figures (9)

Figure 1: An overview of the proposed metric, UOT-ERRANT. Edits are extracted from the hypothesis and the reference, respectively, and converted into edit vectors. The optimal transport plan of edit vectors from the hypothesis to the reference vectors is then decomposed into a precision, recall, and $F_{0.5}$ score.
Figure 2: System-wise average number of edits per sentence in both Wiki domain of GMEG-Data (Orange) and SEEDA (Blue). For each dataset, the bars are sorted by the heights.
Figure 3: Scatter plot for 14 systems on SEEDA-E +Fluency. The $x$-axis and $y$-axis represent human and metric scores, respectively.
Figure 4: A case study: Alignment between four hypothesis edits ($y$-axis) and three reference edits ($x$-axis). The actual sentences are as follows: Source: "It is still early for parents to decide whether they can foster a new life that are not able to work and may suffer the pain in the entire life ." Reference: "$\dots$ new life that is not able to work and may suffer their entire life ." Hypothesis: "$\dots$ new life that is not able to work and may suffer pain throughouttheir life ."
Figure 5: Visualization of edit vectors with dimensionality reduced by t-SNE. The plot is color-coded according to the error type.
...and 4 more figures

Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation

TL;DR

Abstract

Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)