Table of Contents
Fetching ...

Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities

Maria Levchenko

TL;DR

This work addresses the challenge of evaluating Large Language Model (LLM)–based OCR on historical documents, where standard OCR metrics fail to capture temporal biases and contamination risks. It proposes a contamination-aware evaluation framework with novel metrics, Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), and applies it to 1,029 pages of 18th‑century Russian Civil-font texts across 12 multimodal LLMs, multiple processing modes, and prompt strategies. Results show Gemini and Qwen outperform traditional OCR yet exhibit over-historicization, and post-OCR correction can degrade performance, highlighting limitations of edit-based approaches. The framework provides digital humanities practitioners with concrete guidelines for model selection, ground-truth creation, and robust evaluation while emphasizing data-contamination considerations for ongoing benchmarking and reproducibility.

Abstract

Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.

Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities

TL;DR

This work addresses the challenge of evaluating Large Language Model (LLM)–based OCR on historical documents, where standard OCR metrics fail to capture temporal biases and contamination risks. It proposes a contamination-aware evaluation framework with novel metrics, Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), and applies it to 1,029 pages of 18th‑century Russian Civil-font texts across 12 multimodal LLMs, multiple processing modes, and prompt strategies. Results show Gemini and Qwen outperform traditional OCR yet exhibit over-historicization, and post-OCR correction can degrade performance, highlighting limitations of edit-based approaches. The framework provides digital humanities practitioners with concrete guidelines for model selection, ground-truth creation, and robust evaluation while emphasizing data-contamination considerations for ongoing benchmarking and reproducibility.

Abstract

Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.

Paper Structure

This paper contains 11 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: CER distribution by models (full page mode)
  • Figure 2: Mean character error rate (CER) by model and prompt strategy (simple English, context-enhanced English, context-enhanced Russian). Lower values indicate better performance.
  • Figure 3: Model sensitivity to document features. Each cell shows the absolute correlation between a given feature (rows) and OCR error rates (CER/WER, averaged) for each model (columns; names shortened for readability). Higher values indicate greater sensitivity—that is, a model’s performance degrades more as that document feature increases. The most robust models (e.g., Gemini-2.5-Pro, o4-mini) exhibit consistently low sensitivity, while others (e.g., Claude3.5, Llama4-Mav) show heightened sensitivity to line count, old-character content, and layout complexity.
  • Figure 4: Excerpt from an 18th-century Russian book printed in Civil font. The letters “ш” (as in обетшаша, ослабѣша, Бофортши, отшествїи) display notable typographic variability, occasionally resembling the “т” glyph. Such variability, inherent to period printing, contributes to frequent “т→ш” substitution errors.
  • Figure 5: Subject distribution in the evaluation dataset. Left: Distribution by unique books (N=428). Right: Distribution by sampled page images (N=1029). The dataset is dominated by fiction, religion, history, and science, but maintains coverage across a variety of genres, supporting generalizable evaluation of historical OCR models.
  • ...and 1 more figures