Table of Contents
Fetching ...

Practical Fine-Tuning of Autoregressive Models on Limited Handwritten Texts

Jan Kohút, Michal Hradiš

TL;DR

This work tackles fine-tuning autoregressive transformer OCR models under limited handwritten data, addressing practical needs for early, stable adaptation during correction-based transcription. It systematically analyzes how different transformer components contribute to adaptation, evaluates multiple stopping criteria, and explores active learning to reduce annotation effort. Key findings show that full-model fine-tuning is most reliable, the encoder is often sufficient for known styles while the decoder becomes important for unseen languages or transcription conventions, and a simple stopping rule (TRN_CER) performs robustly with limited data; active learning further halves annotation requirements while preserving gains. The results provide actionable guidance for real-world OCR deployment, enabling cost-efficient adaptation across diverse writers and historical scripts.

Abstract

A common use case for OCR applications involves users uploading documents and progressively correcting automatic recognition to obtain the final transcript. This correction phase presents an opportunity for progressive adaptation of the OCR model, making it crucial to adapt early, while ensuring stability and reliability. We demonstrate that state-of-the-art transformer-based models can effectively support this adaptation, gradually reducing the annotator's workload. Our results show that fine-tuning can reliably start with just 16 lines, yielding a 10% relative improvement in CER, and scale up to 40% with 256 lines. We further investigate the impact of model components, clarifying the roles of the encoder and decoder in the fine-tuning process. To guide adaptation, we propose reliable stopping criteria, considering both direct approaches and global trend analysis. Additionally, we show that OCR models can be leveraged to cut annotation costs by half through confidence-based selection of informative lines, achieving the same performance with fewer annotations.

Practical Fine-Tuning of Autoregressive Models on Limited Handwritten Texts

TL;DR

This work tackles fine-tuning autoregressive transformer OCR models under limited handwritten data, addressing practical needs for early, stable adaptation during correction-based transcription. It systematically analyzes how different transformer components contribute to adaptation, evaluates multiple stopping criteria, and explores active learning to reduce annotation effort. Key findings show that full-model fine-tuning is most reliable, the encoder is often sufficient for known styles while the decoder becomes important for unseen languages or transcription conventions, and a simple stopping rule (TRN_CER) performs robustly with limited data; active learning further halves annotation requirements while preserving gains. The results provide actionable guidance for real-world OCR deployment, enabling cost-efficient adaptation across diverse writers and historical scripts.

Abstract

A common use case for OCR applications involves users uploading documents and progressively correcting automatic recognition to obtain the final transcript. This correction phase presents an opportunity for progressive adaptation of the OCR model, making it crucial to adapt early, while ensuring stability and reliability. We demonstrate that state-of-the-art transformer-based models can effectively support this adaptation, gradually reducing the annotator's workload. Our results show that fine-tuning can reliably start with just 16 lines, yielding a 10% relative improvement in CER, and scale up to 40% with 256 lines. We further investigate the impact of model components, clarifying the roles of the encoder and decoder in the fine-tuning process. To guide adaptation, we propose reliable stopping criteria, considering both direct approaches and global trend analysis. Additionally, we show that OCR models can be leveraged to cut annotation costs by half through confidence-based selection of informative lines, achieving the same performance with fewer annotations.

Paper Structure

This paper contains 13 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Word for each writer from our target dataset. Original visualization kohut2023fine.
  • Figure 2: Fine-tuning with $\mathrm{TNR_{CER}}$, colors correspond to writers as in Figure \ref{['fig:dataset:source_and_target']}
  • Figure 3: Fine-tuning components of BASE baseline for native transcription styles.
  • Figure 4: Fine-tuning components of BASE baseline for unseen transcription styles.
  • Figure 5: Comparison of stopping criteria, results normalized with $\mathrm{TST_{CER}}$.
  • ...and 3 more figures