Table of Contents
Fetching ...

EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction

Lei Sheng, Shuai-Shuai Xu

TL;DR

This work proposes two data augmentation methods to address spelling errors in Chinese sentences caused by phonetic or visual similarities and demonstrates the superiority of this approach over most existing models, achieving state-of-the-art performance on the SIGHAN15 test set.

Abstract

Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. While current CSC models integrate pinyin or glyph features and have shown significant progress,they still face challenges when dealing with sentences containing multiple typos and are susceptible to overcorrection in real-world scenarios. In contrast to existing model-centric approaches, we propose two data augmentation methods to address these limitations. Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos. Subsequently, we employ different training processes to select the optimal model. Experimental evaluations on the SIGHAN benchmarks demonstrate the superiority of our approach over most existing models, achieving state-of-the-art performance on the SIGHAN15 test set.

EdaCSC: Two Easy Data Augmentation Methods for Chinese Spelling Correction

TL;DR

This work proposes two data augmentation methods to address spelling errors in Chinese sentences caused by phonetic or visual similarities and demonstrates the superiority of this approach over most existing models, achieving state-of-the-art performance on the SIGHAN15 test set.

Abstract

Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese sentences caused by phonetic or visual similarities. While current CSC models integrate pinyin or glyph features and have shown significant progress,they still face challenges when dealing with sentences containing multiple typos and are susceptible to overcorrection in real-world scenarios. In contrast to existing model-centric approaches, we propose two data augmentation methods to address these limitations. Firstly, we augment the dataset by either splitting long sentences into shorter ones or reducing typos in sentences with multiple typos. Subsequently, we employ different training processes to select the optimal model. Experimental evaluations on the SIGHAN benchmarks demonstrate the superiority of our approach over most existing models, achieving state-of-the-art performance on the SIGHAN15 test set.
Paper Structure (22 sections, 2 figures, 5 tables)

This paper contains 22 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Different types of training datasets and training procedures. The first part shows the relationship between datasets. The original training set TrainData (obtained by the merger of SIGHAN and Wang271K) and two data augmentation methods are used to obtain the the TrainShortData dataset and the TrainReduceData dataset respectively, and then combined to obtain the third dataset TrainMergeData. The following part shows different training processes, a, b, c, and d show that only one dataset is trained alone, e, f, and g show training on the first dataset first, and then the second dataset to train.
  • Figure 2: Overview of EdaCSC data augmentation methods. We perform data augmentation on the SIGHAN and Wang271K training data by two methods. The first method: split long sentences through punctuation marks (",", ".", "!" , "?" , "..." , "......") as segmentation points into multiple short sentences. The second method: reduce the typos in sentences containing multiple typos in turn, thereby generating multiple sentences.