Table of Contents
Fetching ...

Chinese Grammatical Error Correction: A Survey

Mengyang Qiu, Qingyu Gao, Linxuan Yang, Yang Gu, Tran Minh Nguyen, Zihao Huang, Jungyeul Park

TL;DR

The paper surveys Chinese Grammatical Error Correction (CGEC), detailing datasets, annotation schemes, evaluation metrics, and system progress from rule-based to neural approaches. It highlights dataset diversity (CGED, NLPCC, MuCGEC, FCGEC, FlaCGEC, YACLC, CCTC, NaCGEC, CEFE) and the linguistic challenges unique to Chinese, such as segmentation, homophones, and de particles. It reviews annotation tools (errant, ChERRANT) and a refined six-type typology for native and learner errors, emphasizing cross-dataset standardization and multilingual data fusion. The work argues for standardized evaluation, multi-reference resources, and cross-sentence/document approaches to advance CGEC applicability in real-world writing assistance and education.

Abstract

Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.

Chinese Grammatical Error Correction: A Survey

TL;DR

The paper surveys Chinese Grammatical Error Correction (CGEC), detailing datasets, annotation schemes, evaluation metrics, and system progress from rule-based to neural approaches. It highlights dataset diversity (CGED, NLPCC, MuCGEC, FCGEC, FlaCGEC, YACLC, CCTC, NaCGEC, CEFE) and the linguistic challenges unique to Chinese, such as segmentation, homophones, and de particles. It reviews annotation tools (errant, ChERRANT) and a refined six-type typology for native and learner errors, emphasizing cross-dataset standardization and multilingual data fusion. The work argues for standardized evaluation, multi-reference resources, and cross-sentence/document approaches to advance CGEC applicability in real-world writing assistance and education.

Abstract

Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.

Paper Structure

This paper contains 45 sections, 1 equation, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: File format and sentence examples: CGED 2014
  • Figure 2: File format and sentence examples: CGED 2015
  • Figure 3: File format and sentence examples: CGEC2016
  • Figure 4: File format and sentence examples: CGEC2020-2021
  • Figure 5: File format and sentence examples: FCGEC
  • ...and 9 more figures