Context-aware Stand-alone Neural Spelling Correction
Xiangci Li, Hairong Liu, Liang Huang
TL;DR
This work tackles stand-alone spelling correction, focusing on correcting individual tokens without altering token count and leveraging both orthographic cues and global context. It introduces two Transformer-based encoders, a Word+Char model that combines global context with per-word spelling and a Subword model that utilizes subword representations and enables LM pre-training, with additional robustness from synthetic character-level noise. Through extensive experiments on a dataset derived from the 1B-Word Benchmark, the approach achieves a 12.8 percentage-point improvement in $F_{0.5}$ over the prior state-of-the-art, validating the benefit of jointly modeling spelling and context and of pre-trained LM initialization. The results demonstrate that effective stand-alone spelling correction benefits from (i) combining spelling and context information, (ii) leveraging pre-trained language models, and (iii) training with diverse, synthetic misspellings to improve robustness for unseen errors.
Abstract
Existing natural language processing systems are vulnerable to noisy inputs resulting from misspellings. On the contrary, humans can easily infer the corresponding correct words from their misspellings and surrounding context. Inspired by this, we address the stand-alone spelling correction problem, which only corrects the spelling of each token without additional token insertion or deletion, by utilizing both spelling information and global context representations. We present a simple yet powerful solution that jointly detects and corrects misspellings as a sequence labeling task by fine-turning a pre-trained language model. Our solution outperforms the previous state-of-the-art result by 12.8% absolute F0.5 score.
