Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts
Queenie Luo, Yung-Sung Chuang
TL;DR
The paper tackles OCR-induced spelling errors in Tibetan manuscript text by introducing a Transformer-based post-processing model augmented with OCR Confidence Score embeddings. The approach leverages paired noisy OCR outputs and human-corrected references, achieving a reduction from an initial $25\%$ error rate to a CER of $0.1226$, and demonstrates clear gains over Transformer, LSTM, and GRU baselines. Key contributions include the data pipeline with per-token confidence scores, the Confidence Score embedding mechanism, and comprehensive ablations on BPE vocabulary and confidence-vocabulary sizes, supported by attention-heatmap and erroneous-token analyses. The work provides a practical, language-agnostic framework for OCR post-processing that can be extended with larger vocabularies and Tibetan-language contextual models (e.g., Tibetan BERT) to further improve semantic corrections and cross-language adaptability.
Abstract
Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.
