Table of Contents
Fetching ...

Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction

Dingyao Yu, Yang An, Wei Ye, Xiongfeng Xiao, Shaoguang Mao, Tao Ge, Shikun Zhang

TL;DR

The paper addresses the scarcity of high-quality CSC corpora and the noise introduced by common augmentation methods. It reveals a calibration-generalization trade-off between OCR/ASR-based and random-replacement data, and introduces a corpus-refining pipeline that uses a well-calibrated model trained on random replacements to filter OCR/ASR samples before training a final CSC model. Theoretical Bayesian analysis explains how sample type affects model confidence and motivates upper-bounded confidence-based filtering, which yields competitive results and reduced over-correction on SIGHAN13/14/15. Practically, the method offers a simple, data-efficient route to robust and well-calibrated CSC systems in real-world settings, with demonstrated improvements in calibration metrics and false-positive rates.

Abstract

Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality corpora, due to the labor-intensive labeling of spelling errors in real-life human writing or typing scenarios. Two data augmentation methods are widely adopted: (1) \textit{Random Replacement} with the guidance of confusion sets and (2) \textit{OCR/ASR-based Generation} that simulates character misusing. However, both methods inevitably introduce noisy data (e.g., false spelling errors), potentially leading to over-correction. By carefully analyzing the two types of corpora, we find that though the latter achieves more robust generalization performance, the former yields better-calibrated CSC models. We then provide a theoretical analysis of this empirical observation, based on which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data samples are fed into a well-calibrated CSC model trained on random replacement-based corpora and then filtered based on prediction confidence. By learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set up impressive state-of-the-art performance on three widely-used benchmarks, while significantly alleviating over-correction (e.g., lowering false positive predictions).

Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction

TL;DR

The paper addresses the scarcity of high-quality CSC corpora and the noise introduced by common augmentation methods. It reveals a calibration-generalization trade-off between OCR/ASR-based and random-replacement data, and introduces a corpus-refining pipeline that uses a well-calibrated model trained on random replacements to filter OCR/ASR samples before training a final CSC model. Theoretical Bayesian analysis explains how sample type affects model confidence and motivates upper-bounded confidence-based filtering, which yields competitive results and reduced over-correction on SIGHAN13/14/15. Practically, the method offers a simple, data-efficient route to robust and well-calibrated CSC systems in real-world settings, with demonstrated improvements in calibration metrics and false-positive rates.

Abstract

Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality corpora, due to the labor-intensive labeling of spelling errors in real-life human writing or typing scenarios. Two data augmentation methods are widely adopted: (1) \textit{Random Replacement} with the guidance of confusion sets and (2) \textit{OCR/ASR-based Generation} that simulates character misusing. However, both methods inevitably introduce noisy data (e.g., false spelling errors), potentially leading to over-correction. By carefully analyzing the two types of corpora, we find that though the latter achieves more robust generalization performance, the former yields better-calibrated CSC models. We then provide a theoretical analysis of this empirical observation, based on which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data samples are fed into a well-calibrated CSC model trained on random replacement-based corpora and then filtered based on prediction confidence. By learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set up impressive state-of-the-art performance on three widely-used benchmarks, while significantly alleviating over-correction (e.g., lowering false positive predictions).
Paper Structure (32 sections, 14 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 14 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Calibration curves and performance of BERT-based CSC models trained on random replacement and OCR/ASR-based data. ECE means the metric of Expected Calibration Error guo2017calibration, and FPR means the sentence-level false positive rate that measures over-corrections. Combing subplots (a), (b), and (c), OCR/ASR-based data produce better performances on standard metrics (e.g., P, R, and F1), while random replacement yields better calibration and FPR. These observations inspire us to denoise OCR/ASR-based data with well-calibrated CSC models trained on random replacement data, to improve performance and mitigate over-corrections.
  • Figure 2: Conceptual illustration of sample confidence and the filtering process for noisy samples. The upper part demonstrates the variability of model confidence across different samples. The bottom part illustrates the utilization of confidence to identify and filter out noisy samples. The dotted line represents a scalar, while the plane serves as a visual aid for better comprehension.
  • Figure 3: Case study of noisy and multi-answer samples. Regarding the noisy sample, we cannot tell from the given context whether "he" or "she" would be written here, generally we do not consider it a spelling error. As for the multi-answer sample, the original sentence and the alternative one are both contextually reasonable, meanwhile "要" and "收" are both in the confusion set of the character "咬" based on phonology or morphology.
  • Figure 4: The filtering ratio of noisy samples and multi-answer samples with our method and self-filtering method.
  • Figure 5: F1 and FPR of the method on three datasets with different filtering thresholds $p$.
  • ...and 1 more figures