Table of Contents
Fetching ...

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Queenie Luo, Yung-Sung Chuang

TL;DR

The paper tackles OCR-induced spelling errors in Tibetan manuscript text by introducing a Transformer-based post-processing model augmented with OCR Confidence Score embeddings. The approach leverages paired noisy OCR outputs and human-corrected references, achieving a reduction from an initial $25\%$ error rate to a CER of $0.1226$, and demonstrates clear gains over Transformer, LSTM, and GRU baselines. Key contributions include the data pipeline with per-token confidence scores, the Confidence Score embedding mechanism, and comprehensive ablations on BPE vocabulary and confidence-vocabulary sizes, supported by attention-heatmap and erroneous-token analyses. The work provides a practical, language-agnostic framework for OCR post-processing that can be extended with larger vocabularies and Tibetan-language contextual models (e.g., Tibetan BERT) to further improve semantic corrections and cross-language adaptability.

Abstract

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

TL;DR

The paper tackles OCR-induced spelling errors in Tibetan manuscript text by introducing a Transformer-based post-processing model augmented with OCR Confidence Score embeddings. The approach leverages paired noisy OCR outputs and human-corrected references, achieving a reduction from an initial error rate to a CER of , and demonstrates clear gains over Transformer, LSTM, and GRU baselines. Key contributions include the data pipeline with per-token confidence scores, the Confidence Score embedding mechanism, and comprehensive ablations on BPE vocabulary and confidence-vocabulary sizes, supported by attention-heatmap and erroneous-token analyses. The work provides a practical, language-agnostic framework for OCR post-processing that can be extended with larger vocabularies and Tibetan-language contextual models (e.g., Tibetan BERT) to further improve semantic corrections and cross-language adaptability.

Abstract

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.
Paper Structure (14 sections, 3 equations, 5 figures, 4 tables)

This paper contains 14 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Intended Usage: Unlike the common End-to-End method, which processes original raw images into texts, our Transformer + Confidence Score mechanism is designed for post-processing spelling errors in Google's OCR outputs.
  • Figure 2: Model Architecture. The Confidence Score mechanism is incorporated into the Transformer architecture, designed to integrate OCR system confidence scores. This is achieved by augmenting the standard input embeddings with additional confidence score embeddings.
  • Figure 3: The figure on the left shows tokens that the model generally succeeds in correcting. These tokens are so distinctly different from Tibetan alphabets, such as "y", "7", and "@" that the model can easily identify and edit them. The figure on the right shows tokens that the model generally fails to correct. Tokens such as "", "", and "" are among the most common Tibetan alphabets, which can combine with numerous other Tibetan syllables to form meaningful words.
  • Figure 4: Attention Heatmaps. The attention heatmaps are generated using the third attention layer, averaged across four heads in both encoder and decoder. Dark squares indicate low attention weights and correlations between tokens, while bright squares indicate high attention and correlations. Subfigures (a), (b), and (c) show the Source-Attention in the decoder, and Self-Attention in the encoder and the decoder, respectively.
  • Figure 5: An example of the model's short-sightedness, where "kife" is incorrectly corrected to "knife" instead of "kite" because it focuses on the nearest 2-3 neighbors. Note: This example is presented in English for readability, but a similar issue occurs in the model’s Tibetan output.