Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models
Kostiantyn Omelianchuk, Andrii Liubonko, Oleksandr Skurzhanskyi, Artem Chernodub, Oleksandr Korniienko, Igor Samokhin
TL;DR
This paper addresses the challenge of achieving robust Grammatical Error Correction by comprehensively comparing single-model approaches across LLMs, Seq2Seq, and edit-based systems, and by systematically studying ensembling and ranking strategies. It introduces open-science practices and demonstrates that simple majority voting among diverse single-model outputs can reach or exceed prior state-of-the-art, while second-order ensembling combining multiple methods yields further gains, culminating in $F_{0.5}$ scores of $72.8$ on CoNLL-2014-test and $81.4$ on BEA-test. The authors also explore large language models in zero-shot and fine-tuned settings, and show GPT-4 can serve as a competitive ranking component within ensembles, albeit with recall-leaning tendencies. The results emphasize data quality and ensemble diversity as key bottlenecks and opportunities, suggesting that future progress will rely more on data and system combination strategies than mere model scale, while providing reusable resources for reproducibility.
Abstract
In this paper, we carry out experimental research on Grammatical Error Correction, delving into the nuances of single-model systems, comparing the efficiency of ensembling and ranking methods, and exploring the application of large language models to GEC as single-model systems, as parts of ensembles, and as ranking methods. We set new state-of-the-art performance with F_0.5 scores of 72.8 on CoNLL-2014-test and 81.4 on BEA-test, respectively. To support further advancements in GEC and ensure the reproducibility of our research, we make our code, trained models, and systems' outputs publicly available.
