Table of Contents
Fetching ...

Chinese Spelling Correction as Rephrasing Language Model

Linfeng Liu, Hongqiu Wu, Hai Zhao

TL;DR

This work reframes Chinese Spelling Correction (CSC) from a character-to-character tagging task to a semantic rephrasing objective using Rephrasing Language Model (ReLM). By encoding the source sentence into semantics and infilling a masked target, ReLM avoids overfitting to specific edits and better leverages pre-trained language representations. It achieves new state-of-the-art results on fine-tuned and zero-shot CSC benchmarks and demonstrates improved transfer in multi-task settings, addressing generalization and cross-task applicability. The approach offers practical benefits for real-world CSC deployment, reducing over-editing and enabling robust cross-task performance through MLM-style templating and prompting.

Abstract

This paper studies Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. However, we note a critical flaw in the process of tagging one character to another, that the correction is excessively conditioned on the error. This is opposite from human mindset, where individuals rephrase the complete sentence based on its semantics, rather than solely on the error patterns memorized before. Such a counter-intuitive learning process results in the bottleneck of generalizability and transferability of machine spelling correction. To address this, we propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging. This novel training paradigm achieves the new state-of-the-art results across fine-tuned and zero-shot CSC benchmarks, outperforming previous counterparts by a large margin. Our method also learns transferable language representation when CSC is jointly trained with other tasks.

Chinese Spelling Correction as Rephrasing Language Model

TL;DR

This work reframes Chinese Spelling Correction (CSC) from a character-to-character tagging task to a semantic rephrasing objective using Rephrasing Language Model (ReLM). By encoding the source sentence into semantics and infilling a masked target, ReLM avoids overfitting to specific edits and better leverages pre-trained language representations. It achieves new state-of-the-art results on fine-tuned and zero-shot CSC benchmarks and demonstrates improved transfer in multi-task settings, addressing generalization and cross-task applicability. The approach offers practical benefits for real-world CSC deployment, reducing over-editing and enabling robust cross-task performance through MLM-style templating and prompting.

Abstract

This paper studies Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence. Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs. However, we note a critical flaw in the process of tagging one character to another, that the correction is excessively conditioned on the error. This is opposite from human mindset, where individuals rephrase the complete sentence based on its semantics, rather than solely on the error patterns memorized before. Such a counter-intuitive learning process results in the bottleneck of generalizability and transferability of machine spelling correction. To address this, we propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging. This novel training paradigm achieves the new state-of-the-art results across fine-tuned and zero-shot CSC benchmarks, outperforming previous counterparts by a large margin. Our method also learns transferable language representation when CSC is jointly trained with other tasks.
Paper Structure (24 sections, 3 equations, 4 figures, 7 tables)

This paper contains 24 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of tagging spelling correction and human spelling correction.
  • Figure 2: Paradigm of ReLM in single-task (left) and multi-task (right) settings. The source sentence for CSC is "taking a pair ($\rightarrow$ piece) of painting", and $\left<\rm m\right>$ and $\left<\rm s\right>$ refer to the mask and separate character respectively. On the right, we depict three tasks as a representative, CSC, language inference, and sentiment analysis, and $p$ refers to the prompt for each task.
  • Figure 3: Performance variation (precision and F1) with the proportion of negative and positive samples.
  • Figure 4: Cases selected from ECSpell.