Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

Kostiantyn Omelianchuk; Andrii Liubonko; Oleksandr Skurzhanskyi; Artem Chernodub; Oleksandr Korniienko; Igor Samokhin

Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

Kostiantyn Omelianchuk, Andrii Liubonko, Oleksandr Skurzhanskyi, Artem Chernodub, Oleksandr Korniienko, Igor Samokhin

TL;DR

This paper addresses the challenge of achieving robust Grammatical Error Correction by comprehensively comparing single-model approaches across LLMs, Seq2Seq, and edit-based systems, and by systematically studying ensembling and ranking strategies. It introduces open-science practices and demonstrates that simple majority voting among diverse single-model outputs can reach or exceed prior state-of-the-art, while second-order ensembling combining multiple methods yields further gains, culminating in $F_{0.5}$ scores of $72.8$ on CoNLL-2014-test and $81.4$ on BEA-test. The authors also explore large language models in zero-shot and fine-tuned settings, and show GPT-4 can serve as a competitive ranking component within ensembles, albeit with recall-leaning tendencies. The results emphasize data quality and ensemble diversity as key bottlenecks and opportunities, suggesting that future progress will rely more on data and system combination strategies than mere model scale, while providing reusable resources for reproducibility.

Abstract

In this paper, we carry out experimental research on Grammatical Error Correction, delving into the nuances of single-model systems, comparing the efficiency of ensembling and ranking methods, and exploring the application of large language models to GEC as single-model systems, as parts of ensembles, and as ranking methods. We set new state-of-the-art performance with F_0.5 scores of 72.8 on CoNLL-2014-test and 81.4 on BEA-test, respectively. To support further advancements in GEC and ensure the reproducibility of our research, we make our code, trained models, and systems' outputs publicly available.

Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

TL;DR

scores of

on CoNLL-2014-test and

on BEA-test. The authors also explore large language models in zero-shot and fine-tuned settings, and show GPT-4 can serve as a competitive ranking component within ensembles, albeit with recall-leaning tendencies. The results emphasize data quality and ensemble diversity as key bottlenecks and opportunities, suggesting that future progress will rely more on data and system combination strategies than mere model scale, while providing reusable resources for reproducibility.

Abstract

Paper Structure (22 sections, 1 equation, 2 figures, 12 tables)

This paper contains 22 sections, 1 equation, 2 figures, 12 tables.

Introduction
Data for Training and Evaluation
Single-Model Systems
Large Language Models
Zero-Shot Prompting
Fine-tuning the Large Language Models
Sequence-to-Sequence models
Edit-based Systems
Single-Model Systems Results
Ensembling and Ranking of Single-Model Systems
Oracle-Ensembling and Oracle-Ranking as Upper-Bound Baselines
Ensembling by Majority Votes on Edit Spans (Unsupervised)
Ensembling and Ranking by GRECO Model (Supervised Quality Estimation)
Ranking by GPT-4 (Zero-Shot)
Ensembles of Ensembles
...and 7 more sections

Figures (2)

Figure 1: Combining the single-model systems’ outputs. Left: In ensembling, candidates (system outputs) are aggregated on an edit level. Right: In ranking, candidates (system outputs) are aggregated on a sentence level. We consider ranking to be a special case of ensembling.
Figure 2: Dendrogram of hierarchical clustering analysis for single-model systems. The y-axis represents the distance metric used for clustering, with a red dashed line indicating the selected threshold for cluster formation ($t = 0.11$). The x-axis enumerates different systems that were analyzed. The dendrogram branches reflect the hierarchical grouping based on the proximity of distance metrics.

Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

TL;DR

Abstract

Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)