Table of Contents
Fetching ...

Chinese Spelling Correction: A Comprehensive Survey of Progress, Challenges, and Opportunities

Changchun Liu, Kai Zhang, Junzhe Jiang, Zixiao Kong, Qi Liu, Enhong Chen

TL;DR

This survey systematically analyzes Chinese Spelling Correction (CSC) from rule-based and statistical beginnings to modern PLMs and emerging LLMs, formalizing the task and detailing architectural families (Information-Learning and Detector-Corrector). It surveys how phonetic (pinyin) and visual (glyph) character information, along with confusion sets, are learned and integrated, and reviews key datasets and evaluation criteria used for benchmarking. The paper highlights persistent challenges in PLMs (overcorrection, generalization, consecutive errors), LLMs (length control, overcorrection, phonetic reasoning), and dataset quality, proposing future directions that leverage alignment, retrieval-augmented approaches, and cross-domain data to enhance CSC performance. Overall, it provides a comprehensive roadmap for researchers to advance CSC, particularly by exploiting LLM reasoning and domain-adaptive data to achieve robust, scalable corrections across diverse Chinese text domains.

Abstract

Chinese Spelling Correction (CSC) is a critical task in natural language processing, aimed at detecting and correcting spelling errors in Chinese text. This survey provides a comprehensive overview of CSC, tracing its evolution from pre-trained language models to large language models, and critically analyzing their respective strengths and weaknesses in this domain. Moreover, we further present a detailed examination of existing benchmark datasets, highlighting their inherent challenges and limitations. Finally, we propose promising future research directions, particularly focusing on leveraging the potential of LLMs and their reasoning capabilities for improved CSC performance. To the best of our knowledge, this is the first comprehensive survey dedicated to the field of CSC. We believe this work will serve as a valuable resource for researchers, fostering a deeper understanding of the field and inspiring future advancements.

Chinese Spelling Correction: A Comprehensive Survey of Progress, Challenges, and Opportunities

TL;DR

This survey systematically analyzes Chinese Spelling Correction (CSC) from rule-based and statistical beginnings to modern PLMs and emerging LLMs, formalizing the task and detailing architectural families (Information-Learning and Detector-Corrector). It surveys how phonetic (pinyin) and visual (glyph) character information, along with confusion sets, are learned and integrated, and reviews key datasets and evaluation criteria used for benchmarking. The paper highlights persistent challenges in PLMs (overcorrection, generalization, consecutive errors), LLMs (length control, overcorrection, phonetic reasoning), and dataset quality, proposing future directions that leverage alignment, retrieval-augmented approaches, and cross-domain data to enhance CSC performance. Overall, it provides a comprehensive roadmap for researchers to advance CSC, particularly by exploiting LLM reasoning and domain-adaptive data to achieve robust, scalable corrections across diverse Chinese text domains.

Abstract

Chinese Spelling Correction (CSC) is a critical task in natural language processing, aimed at detecting and correcting spelling errors in Chinese text. This survey provides a comprehensive overview of CSC, tracing its evolution from pre-trained language models to large language models, and critically analyzing their respective strengths and weaknesses in this domain. Moreover, we further present a detailed examination of existing benchmark datasets, highlighting their inherent challenges and limitations. Finally, we propose promising future research directions, particularly focusing on leveraging the potential of LLMs and their reasoning capabilities for improved CSC performance. To the best of our knowledge, this is the first comprehensive survey dedicated to the field of CSC. We believe this work will serve as a valuable resource for researchers, fostering a deeper understanding of the field and inspiring future advancements.

Paper Structure

This paper contains 20 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Two classic architectures of CSC model are presented. On the left, the I-L architecture illustrates the information flow, where some models integrate Chinese character information into the model's training and reasoning, while others incorporate it solely into the loss function. On the right, the D-C architecture demonstrates the information flow of four classic D-C models. The similarities and differences between them are clearly depicted. The English translation of the modified sentence "事实胜于雄辩" is "Facts speak louder than words".