RE$^2$: Improving Chinese Grammatical Error Correction via Retrieving Appropriate Examples with Explanation
Baoxin Wang, Yumeng Luo, Yixuan Wang, Dayong Wu, Wanxiang Che, Shijin Wang
TL;DR
This paper addresses Chinese Grammatical Error Correction (CGEC) by tackling the challenge of selecting effective reference examples for prompting large language models. It introduces RE^2, a retrieval method that uses grammatical error explanations (GEE) to find contextually relevant reference examples and leverages both In-Context Learning (ICL) and Supervised Fine-Tuning (SFT). A high-quality Grammatical Error Explanation dataset (FCGEE) is constructed to support explanation generation and retrieval, and the approach yields state-of-the-art results on native-speaker CGEC datasets FCGEC and NaCGEC, particularly when paired with Qwen2-7B-Instruct or GPT-4o. The work also analyzes explanation quality, retrieval strategies, and error-type-specific gains, while acknowledging limitations in inference speed and applicability to non-patterned spelling errors, and it proposes directions for further improving GEC via richer explanations.
Abstract
The primary objective of Chinese grammatical error correction (CGEC) is to detect and correct errors in Chinese sentences. Recent research shows that large language models (LLMs) have been applied to CGEC with significant results. For LLMs, selecting appropriate reference examples can help improve their performance. However, existing methods predominantly rely on text similarity for example retrieval, a strategy that frequently mismatches actual error patterns and retrieves lexically similar yet grammatically irrelevant sentences. To address this problem, we propose a method named RE$^2$, which retrieves appropriate examples with explanations of grammatical errors. Instead of using text similarity of the input sentence, we use explanations of grammatical errors to select reference examples, which are used by LLMs to improve the performance of CGEC. We conduct experiments on two CGEC datasets and create a high-quality grammatical error explanation (GEE) dataset, which is not only used in our research but also serves as a valuable resource for future studies in both CGEC and GEE. The experimental results on the two datasets indicate that our proposed method effectively improves the performance of CGEC.
