Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities
Maria Levchenko
TL;DR
This work addresses the challenge of evaluating Large Language Model (LLM)–based OCR on historical documents, where standard OCR metrics fail to capture temporal biases and contamination risks. It proposes a contamination-aware evaluation framework with novel metrics, Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), and applies it to 1,029 pages of 18th‑century Russian Civil-font texts across 12 multimodal LLMs, multiple processing modes, and prompt strategies. Results show Gemini and Qwen outperform traditional OCR yet exhibit over-historicization, and post-OCR correction can degrade performance, highlighting limitations of edit-based approaches. The framework provides digital humanities practitioners with concrete guidelines for model selection, ground-truth creation, and robust evaluation while emphasizing data-contamination considerations for ongoing benchmarking and reproducibility.
Abstract
Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.
