Table of Contents
Fetching ...

LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction

Yixuan Wang, Baoxin Wang, Yijun Liu, Dayong Wu, Wanxiang Che

TL;DR

This work tackles over-correction in Chinese Grammatical Error Correction (CGEC) by introducing LM-Combiner, a rewriting model that directly refines a single GEC system’s output without model ensembling. It leverages causal language models as the rewriting backbone and employs a novel data construction strategy—k-fold cross inference plus gold-label merging—to train the model on domain-specific over-corrections. Inference uses the original sentence and a system’s output to produce a filtered rewrite, achieving a substantial precision gain (+18.2 points) while preserving recall, and demonstrating strong performance even with small models and limited data. The approach offers a cost-effective, plug-in solution for mitigating over-correction in both native and black-box GEC systems, with practical implications for platforms like search engines and AI chat systems.

Abstract

Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task. Recent work using model ensemble methods based on voting can effectively mitigate over-correction and improve the precision of the GEC system. However, these methods still require the output of several GEC systems and inevitably lead to reduced error recall. In this light, we propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble. Specifically, we train the model on an over-correction dataset constructed through the proposed K-fold cross inference method, which allows it to directly generate filtered sentences by combining the original and the over-corrected text. In the inference stage, we directly take the original sentences and the output results of other systems as input and then obtain the filtered sentences through LM-Combiner. Experiments on the FCGEC dataset show that our proposed method effectively alleviates the over-correction of the original system (+18.2 Precision) while ensuring the error recall remains unchanged. Besides, we find that LM-Combiner still has a good rewriting performance even with small parameters and few training data, and thus can cost-effectively mitigate the over-correction of black-box GEC systems (e.g., ChatGPT).

LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction

TL;DR

This work tackles over-correction in Chinese Grammatical Error Correction (CGEC) by introducing LM-Combiner, a rewriting model that directly refines a single GEC system’s output without model ensembling. It leverages causal language models as the rewriting backbone and employs a novel data construction strategy—k-fold cross inference plus gold-label merging—to train the model on domain-specific over-corrections. Inference uses the original sentence and a system’s output to produce a filtered rewrite, achieving a substantial precision gain (+18.2 points) while preserving recall, and demonstrating strong performance even with small models and limited data. The approach offers a cost-effective, plug-in solution for mitigating over-correction in both native and black-box GEC systems, with practical implications for platforms like search engines and AI chat systems.

Abstract

Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task. Recent work using model ensemble methods based on voting can effectively mitigate over-correction and improve the precision of the GEC system. However, these methods still require the output of several GEC systems and inevitably lead to reduced error recall. In this light, we propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble. Specifically, we train the model on an over-correction dataset constructed through the proposed K-fold cross inference method, which allows it to directly generate filtered sentences by combining the original and the over-corrected text. In the inference stage, we directly take the original sentences and the output results of other systems as input and then obtain the filtered sentences through LM-Combiner. Experiments on the FCGEC dataset show that our proposed method effectively alleviates the over-correction of the original system (+18.2 Precision) while ensuring the error recall remains unchanged. Besides, we find that LM-Combiner still has a good rewriting performance even with small parameters and few training data, and thus can cost-effectively mitigate the over-correction of black-box GEC systems (e.g., ChatGPT).
Paper Structure (25 sections, 4 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: An example of the problem of over-correction, where red represents grammatical errors, blue represents over-correction, and green represents correct changes. LM-Combiner can directly rewrite the system output with reference to the original sentence, filtering out the over-corrections.
  • Figure 2: The flowchart of our error correction-rewriting framework. In the training phase, we construct candidate sentences containing GEC systems over-correction by k-fold cross inference and gold labels merging (see Section \ref{['data_construction']} for details). Then, we train the model to generate gold sentences based on the original and candidate sentences (Section \ref{['LM-Combiner']}). In the inference phase, LM-Combiner directly rewrites the system output based on the original sentence.
  • Figure 3: Comparison between the PPL-based approach and our approach. Both methods take the original sentence and the output of GEC system as input. In the figure, gray squares represent unmodified tokens, green squares represent rightly corrected tokens, and red squares represent overcorrected tokens. Existing work using PPL to rerank different candidate sentences can improve the precision rate of the system, but the judgment is not accurate enough because the LM is not trained on the domain data, leading to reduced recall. The LM-Combiner, trained on constructed candidate sentences, is better able to distinguish over-correction and generate results with higher recall end-to-end.
  • Figure 4: The effect of model size for LM-Combiner on FCGEC valid. The Bart baseline is the system metric without LM-Combiner rewriting. For a more accurate evaluation, we average the results of 5 experiments for each size of the model, and the floating part of the figure shows the standard deviation of the metrics.
  • Figure 5: The effect of training dataset size for LM-Combiner on FCGEC valid. The baseline method represents the metrics for each system without the use of LM-Combiner.