Table of Contents
Fetching ...

Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

Adrià Molina, Oriol Ramos Terrades, Josep Lladós

TL;DR

Fetch-A-Set (FAS) introduces a large-scale OCR-free benchmark for historical document retrieval, addressing text-to-image topic spotting and image-to-text information extraction across centuries of Spanish legislative records. The dataset contains roughly 400K fragment–query pairs with ground-truth associations generated via Mask-RCNN region proposals and entity matching, plus 1,024 distractor documents to enable efficient evaluation. Two baselines, a vision-based ViT-B/32 model and an OCR-based text encoder with sentence-BERT, reveal that vision-oriented methods are more robust under low legibility while text-based methods excel on legible text; results motivate hybrid systems that leverage both modalities. The work also analyzes temporal bias and visual cues, showing that temporal information can be embedded in visual representations and advocating for OCR-free and multimodal approaches to scale and improve historical document understanding in cultural heritage contexts.

Abstract

This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.

Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

TL;DR

Fetch-A-Set (FAS) introduces a large-scale OCR-free benchmark for historical document retrieval, addressing text-to-image topic spotting and image-to-text information extraction across centuries of Spanish legislative records. The dataset contains roughly 400K fragment–query pairs with ground-truth associations generated via Mask-RCNN region proposals and entity matching, plus 1,024 distractor documents to enable efficient evaluation. Two baselines, a vision-based ViT-B/32 model and an OCR-based text encoder with sentence-BERT, reveal that vision-oriented methods are more robust under low legibility while text-based methods excel on legible text; results motivate hybrid systems that leverage both modalities. The work also analyzes temporal bias and visual cues, showing that temporal information can be embedded in visual representations and advocating for OCR-free and multimodal approaches to scale and improve historical document understanding in cultural heritage contexts.

Abstract

This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.
Paper Structure (12 sections, 11 figures, 3 tables)

This paper contains 12 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Illustration depicting the primary objectives pursued by FAS in evaluating and training information extraction systems. Specifically, retrieving document fragments based on a given topic in natural language (right) or generating plausible descriptions from a given fragment (left).
  • Figure 2: Query-Region Matching system for creating the Ground Truth (GT) for FAS dataset. Since the presence of the fragment $F$ in the document $d$ is human-annotated, and the number of invalid fragments within a page is typically low, the risk of adding noise is significantly reduced.
  • Figure 3: A scatterplot depicting FAS as the preeminent historical document-based retrieval benchmark among those previously examined, while also maintaining a substantial temporal scope. This characteristic enhances the robustness of historical document analysis systems by encompassing wider temporal breadths.
  • Figure 4: Beeswarm plot showing the time arrow of the dataset with the years (X-Axis), historical period (Y-Axis) the quantity of documents (size) and the average legibility score (hue).
  • Figure 5: Left: Distribution of legibility (Y-Axis) through the years (X-Axis) per historical period (hue). Right: Global distribution of legibility score.
  • ...and 6 more figures