GPT-3.5 for Grammatical Error Correction
Anisia Katinskaia, Roman Yangarber
TL;DR
The paper assesses GPT-3.5 for Grammatical Error Correction (GEC) across multiple languages in three settings: zero-shot, fine-tuning, and as a re-ranker of other GEC hypotheses. It employs both reference-based and reference-free evaluations, including LM-based critics, Scribendi scores, and semantic similarity, complemented by manual annotations for English and Russian. Findings show strong recall and fluent corrections in English, but substantial semantic drift in several languages, with over-editing tendencies. Re-ranking consistently improves recall and can outperform zero-shot GPT-3.5 in English and Russian, suggesting practical integration as a re-ranker in GEC pipelines. The work also highlights limitations of traditional reference-based metrics for modern LLMs and calls for richer, multi-reference evaluation and broader linguistic coverage in future research.
Abstract
This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages in several settings: zero-shot GEC, fine-tuning for GEC, and using GPT-3.5 to re-rank correction hypotheses generated by other GEC models. In the zero-shot setting, we conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods: estimating grammaticality with language models (LMs), the Scribendi test, and comparing the semantic embeddings of sentences. GPT-3.5 has a known tendency to over-correct erroneous sentences and propose alternative corrections. For several languages, such as Czech, German, Russian, Spanish, and Ukrainian, GPT-3.5 substantially alters the source sentences, including their semantics, which presents significant challenges for evaluation with reference-based metrics. For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics. However, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types, including punctuation mistakes, tense errors, syntactic dependencies between words, and lexical compatibility at the sentence level.
