Table of Contents
Fetching ...

Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models

Arundhathi Dev, Justin Zhan

Abstract

Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical "Pareto frontier" in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.

Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models

Abstract

Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical "Pareto frontier" in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.

Paper Structure

This paper contains 44 sections, 3 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The Domain Gap. We evaluate across a difficulty spectrum. While token-based models (T5) excel on (a), they struggle with the archaic vocabulary and noise in (c), necessitating the byte-level reconstruction of ByT5.
  • Figure 2: Decoupled detection-and-correction architecture. A detection-based visual module localizes and classifies characters in parallel and is pretrained on large-scale synthetic data, followed by lightweight domain adaptation using weak supervision (CTC loss). A pretrained language model corrector is fine-tuned to repair residual recognition errors and capture domain-specific linguistic patterns. Decoupling visual detection from linguistic correction enables efficient adaptation across writing styles and document domains.