Improving Grammatical Error Correction via Contextual Data Augmentation
Yixuan Wang, Baoxin Wang, Yijun Liu, Qingfu Zhu, Dayong Wu, Wanxiang Che
TL;DR
This work tackles data scarcity and noisy labels in grammatical error correction by introducing Contextual Data Augmentation (CDA), which couples rule-based error-pattern substitution with model-generated contexts to produce context-rich synthetic data that match real error distributions. A relabeling-based denoising step mitigates label noise, and a three-stage training pipeline effectively integrates synthetic data into fine-tuning. Empirical results on CoNLL14 and BEA19 demonstrate state-of-the-art performance with a relatively modest amount of synthetic data, highlighting improvements in robustness and error-type coverage, particularly for low-frequency errors. The approach offers a practical path to leveraging synthetic data for GEC in data-limited settings and sets a foundation for further improvements with larger language models.
Abstract
Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.
