Table of Contents
Fetching ...

gec-metrics: A Unified Library for Grammatical Error Correction Evaluation

Takumi Goto, Yusuke Sakai, Taro Watanabe

TL;DR

gec-metrics presents a unified, API-first library for grammatical error correction evaluation, addressing fragmentation and reproducibility in existing metrics by consolidating ten metrics and two meta-evaluation frameworks under a common interface. The framework supports CLI, Python API, and visualization tools, facilitating fair comparisons, metric development, and meta-evaluation studies, with emphasis on transparency and reproducibility. Extensive experiments, including LLM-based metrics and metric ensembling, demonstrate how unified interfaces enable robust analysis and reveal context-dependent correlations with human judgments. By providing extensible abstractions, reproducible configurations, and accessible visualization, gec-metrics aims to accelerate reliable GEC evaluation and broad community adoption.

Abstract

We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.

gec-metrics: A Unified Library for Grammatical Error Correction Evaluation

TL;DR

gec-metrics presents a unified, API-first library for grammatical error correction evaluation, addressing fragmentation and reproducibility in existing metrics by consolidating ten metrics and two meta-evaluation frameworks under a common interface. The framework supports CLI, Python API, and visualization tools, facilitating fair comparisons, metric development, and meta-evaluation studies, with emphasis on transparency and reproducibility. Extensive experiments, including LLM-based metrics and metric ensembling, demonstrate how unified interfaces enable robust analysis and reveal context-dependent correlations with human judgments. By providing extensible abstractions, reproducible configurations, and accessible visualization, gec-metrics aims to accelerate reliable GEC evaluation and broad community adoption.

Abstract

We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.

Paper Structure

This paper contains 34 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: System overview of gec-metrics. The sources are sentences containing grammatical errors, the hypotheses are their corrected version, and the references are human-corrected sentences. Metric classes support both corpus-level and sentence-level evaluation. The MetaEval classes conducts meta-evaluation of metrics, by calculating correlations with human evaluation. These classes also provide analysis and visualize scripts which are useful especially for developers.
  • Figure 2: Examples of input/output for GEC evaluation.
  • Figure 3: Categories of the current GEC metrics. The edit-level metrics considers the overlap of edits. The $n$-gram level metrics categorize $n$-gram into seven groups and use the $n$-gram count for each group. The sentence-level metrics employ neural models and estimate score without references.
  • Figure 4: Window-analysis results for IMPARA. The x-axis indicates the start rank in the human-evaluation, and y-axis means Pearson (blue line) or Spearman (orange line) correlation.
  • Figure 5: IMPARA
  • ...and 4 more figures