Table of Contents
Fetching ...

Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription

Benjamin Gutteridge, Matthew Thomas Jackson, Toni Kukurin, Xiaowen Dong

TL;DR

This work tackles the challenge of transcribing multi-page handwritten documents in a zero-shot setting by leveraging multi-modal large language models (MLLMs) in novel configurations. The authors introduce the +first page method, which supplies the MLLM with the full document OCR output plus only the first-page image, enabling cross-page extrapolation of formatting and OCR-error patterns to unseen pages while balancing cost. Through experiments on a multi-page IAM Handwriting Database, they show that +first page improves transcription accuracy and sits on a cost-performance Pareto frontier relative to traditional OCR and fully vision-based approaches. The study also analyzes how MLLM choice, prompt design, and document length affect performance, and identifies scaling challenges and opportunities for prompt caching to make cross-page transcription more practical. Overall, the work demonstrates promising directions for cost-effective, zero-shot transcription of long handwritten documents using MLLMs and partial vision inputs.

Abstract

Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.

Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription

TL;DR

This work tackles the challenge of transcribing multi-page handwritten documents in a zero-shot setting by leveraging multi-modal large language models (MLLMs) in novel configurations. The authors introduce the +first page method, which supplies the MLLM with the full document OCR output plus only the first-page image, enabling cross-page extrapolation of formatting and OCR-error patterns to unseen pages while balancing cost. Through experiments on a multi-page IAM Handwriting Database, they show that +first page improves transcription accuracy and sits on a cost-performance Pareto frontier relative to traditional OCR and fully vision-based approaches. The study also analyzes how MLLM choice, prompt design, and document length affect performance, and identifies scaling challenges and opportunities for prompt caching to make cross-page transcription more practical. Overall, the work demonstrates promising directions for cost-effective, zero-shot transcription of long handwritten documents using MLLMs and partial vision inputs.

Abstract

Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.

Paper Structure

This paper contains 34 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: An illustration of how +first page works; the OCR text of a multi-page document is provided to an MLLM, along with just the first page image of the document. Blue denotes the first page.
  • Figure 2: An example of how +first page propagates OCR error corrections across pages. Though the MLLM only has access to the image of the first page, it uses the corrections that the OCR (i) frequently mistakes 'i' for '1' and (ii) frequently mistakes words for numbers to correctly transcribe the word 'in' on the unseen second page. See Figures \ref{['fig:example_draws']}--\ref{['fig:example_columns']} in the Appendix for further examples.
  • Figure 3: ocr only ($\rightarrow$ LLM) illustration.
  • Figure 4: ocr only pbp illustration.
  • Figure 5: vision* illustration.
  • ...and 9 more figures