Table of Contents
Fetching ...

Improving Grammatical Error Correction via Contextual Data Augmentation

Yixuan Wang, Baoxin Wang, Yijun Liu, Qingfu Zhu, Dayong Wu, Wanxiang Che

TL;DR

This work tackles data scarcity and noisy labels in grammatical error correction by introducing Contextual Data Augmentation (CDA), which couples rule-based error-pattern substitution with model-generated contexts to produce context-rich synthetic data that match real error distributions. A relabeling-based denoising step mitigates label noise, and a three-stage training pipeline effectively integrates synthetic data into fine-tuning. Empirical results on CoNLL14 and BEA19 demonstrate state-of-the-art performance with a relatively modest amount of synthetic data, highlighting improvements in robustness and error-type coverage, particularly for low-frequency errors. The approach offers a practical path to leveraging synthetic data for GEC in data-limited settings and sets a foundation for further improvements with larger language models.

Abstract

Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.

Improving Grammatical Error Correction via Contextual Data Augmentation

TL;DR

This work tackles data scarcity and noisy labels in grammatical error correction by introducing Contextual Data Augmentation (CDA), which couples rule-based error-pattern substitution with model-generated contexts to produce context-rich synthetic data that match real error distributions. A relabeling-based denoising step mitigates label noise, and a three-stage training pipeline effectively integrates synthetic data into fine-tuning. Empirical results on CoNLL14 and BEA19 demonstrate state-of-the-art performance with a relatively modest amount of synthetic data, highlighting improvements in robustness and error-type coverage, particularly for low-frequency errors. The approach offers a practical path to leveraging synthetic data for GEC in data-limited settings and sets a foundation for further improvements with larger language models.

Abstract

Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.

Paper Structure

This paper contains 32 sections, 3 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Illustration of the distribution of error patterns in each dataset. The x-axis represents the 100 most frequent error patterns in the annotated dataset W&I+L, and the y-axis represents the frequency of that error in the corresponding synthetic dataset.
  • Figure 2: Illustration of synthetic data construction based on contextual augmentation. We uses both fine-tuned GPT2 and ICL of llama2 for the experiments. The red in the sampling patterns represents the wrong pattern and the green represents the correct pattern. Note that we combine the sampled correct patterns into a certain format for context generation, followed by pattern substitution to obtain a parallel corpus. Due to the sample decoding strategy, there may be cases where the context does not fully cover the pattern in the input as in the case of LLaMA. In practice, we generate parallel corpus by directly ignoring the unmatched patterns.
  • Figure 3: Illustration of the three phases of joint training with augmented data. We first denoise the synthetic data (Stage II+ & III+) using a baseline model trained in three stages, and subsequently conduct joint training to obtain a robust model.
  • Figure 4: The results of the three-stage model on BEA19-Dev after contextual augmentation respectively.
  • Figure 5: The effect of different amounts of synthetic data in joint training on the final system.