Table of Contents
Fetching ...

Error-Robust Retrieval for Chinese Spelling Check

Xunjian Yin, Xinyu Hu, Jin Jiang, Xiaojun Wan

TL;DR

Chinese Spelling Check (CSC) faces data scarcity and underutilization of available data. This paper presents RERIC, a plug-and-play retrieval method that augments CSC by constructing an error-robust datastore whose keys fuse phonetic, morphologic, and contextual information, and whose values are n-gram sequences centered on the target token; retrieved candidates are reranked by n-gram overlap before being linearly interpolated with the base CSC predictions. Empirical results on SIGHAN13/14/15 show substantial gains over strong baselines, especially when paired with a multimodal backbone like REALISE, along with ablations demonstrating the necessity of ERI components and the reranking strategy. The approach enables straightforward datastore expansion and incurs modest inference-time overhead, highlighting practical applicability for robust CSC in real-world settings.

Abstract

Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts, which has a wide range of applications. However, it is confronted with the challenges of insufficient annotated data and the issue that previous methods may actually not fully leverage the existing datasets. In this paper, we introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check (RERIC), which can be directly applied to existing CSC models. The datastore for retrieval is built completely based on the training data, with elaborate designs according to the characteristics of CSC. Specifically, we employ multimodal representations that fuse phonetic, morphologic, and contextual information in the calculation of query and key during retrieval to enhance robustness against potential errors. Furthermore, in order to better judge the retrieved candidates, the n-gram surrounding the token to be checked is regarded as the value and utilized for specific reranking. The experiment results on the SIGHAN benchmarks demonstrate that our proposed method achieves substantial improvements over existing work.

Error-Robust Retrieval for Chinese Spelling Check

TL;DR

Chinese Spelling Check (CSC) faces data scarcity and underutilization of available data. This paper presents RERIC, a plug-and-play retrieval method that augments CSC by constructing an error-robust datastore whose keys fuse phonetic, morphologic, and contextual information, and whose values are n-gram sequences centered on the target token; retrieved candidates are reranked by n-gram overlap before being linearly interpolated with the base CSC predictions. Empirical results on SIGHAN13/14/15 show substantial gains over strong baselines, especially when paired with a multimodal backbone like REALISE, along with ablations demonstrating the necessity of ERI components and the reranking strategy. The approach enables straightforward datastore expansion and incurs modest inference-time overhead, highlighting practical applicability for robust CSC in real-world settings.

Abstract

Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts, which has a wide range of applications. However, it is confronted with the challenges of insufficient annotated data and the issue that previous methods may actually not fully leverage the existing datasets. In this paper, we introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check (RERIC), which can be directly applied to existing CSC models. The datastore for retrieval is built completely based on the training data, with elaborate designs according to the characteristics of CSC. Specifically, we employ multimodal representations that fuse phonetic, morphologic, and contextual information in the calculation of query and key during retrieval to enhance robustness against potential errors. Furthermore, in order to better judge the retrieved candidates, the n-gram surrounding the token to be checked is regarded as the value and utilized for specific reranking. The experiment results on the SIGHAN benchmarks demonstrate that our proposed method achieves substantial improvements over existing work.
Paper Structure (31 sections, 12 equations, 3 figures, 8 tables)

This paper contains 31 sections, 12 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: An illustration of our RERIC method with the datastore construction and the inference process including the KNN retrieval and reranking. The key contains the phonetic, morphologic, and contextual information of the token obtained from the base CSC model, and the value is in the form of 3-gram here. There are both correct (the majority) and incorrect tokens (marked in red) in the training and test data. Moreover, the target token and corresponding positions in n-gram values are underlined. And the test sample shows the correction process for the token "以(to)" , which should be corrected to "一(one)".
  • Figure 2: Effect of the number of retrieved neighbors $k$ and the softmax temperature $T$ on the SIGHAN 2015 test set. The performance of the baseline REALISE is represented as a dashed line.
  • Figure 3: Effect of the interpolation parameter $\lambda$ on the SIGHAN 2013, 2014 and 2015 test set. The performance of the REALISE in three test sets is represented as the dashed line in the same color.