Table of Contents
Fetching ...

End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940

Thomas Constum, Lucas Preel, Théo Larcher, Pierrick Tranouez, Thierry Paquet, Sandra Brée

TL;DR

This work introduces M-POPP, a public, page-level dataset of Paris marriage records (1880–1940) annotated for full-page handwritten and typewritten text recognition and information extraction, addressing the challenge of dense, multi-block archival documents. It adapts the DAN end-to-end architecture to perform joint handwriting recognition and IE without segmentation, leveraging curriculum learning, synthetic data, and a novel hierarchical named-entity encoding to capture complex relationships among entities. The authors demonstrate state-of-the-art page-level information extraction on Esposalles and establish strong baselines on M-POPP, while analyzing how different encoding schemes for nested entities affect performance. They also provide preprocessing strategies to manage large page images and discuss plans to integrate language-model based NER in future work, highlighting the practical impact for large-scale historical document analytics and digital humanities research.

Abstract

The EXO-POPP project aims to establish a comprehensive database comprising 300,000 marriage records from Paris and its suburbs, spanning the years 1880 to 1940, which are preserved in over 130,000 scans of double pages. Each marriage record may encompass up to 118 distinct types of information that require extraction from plain text. In this paper, we introduce the M-POPP dataset, a subset of the M-POPP database with annotations for full-page text recognition and information extraction in both handwritten and printed documents, and which is now publicly available. We present a fully end-to-end architecture adapted from the DAN, designed to perform both handwritten text recognition and information extraction directly from page images without the need for explicit segmentation. We showcase the information extraction capabilities of this architecture by achieving a new state of the art for full-page Information Extraction on Esposalles and we use this architecture as a baseline for the M-POPP dataset. We also assess and compare how different encoding strategies for named entities in the text affect the performance of jointly recognizing handwritten text and extracting information, from full pages.

End-to-end information extraction in handwritten documents: Understanding Paris marriage records from 1880 to 1940

TL;DR

This work introduces M-POPP, a public, page-level dataset of Paris marriage records (1880–1940) annotated for full-page handwritten and typewritten text recognition and information extraction, addressing the challenge of dense, multi-block archival documents. It adapts the DAN end-to-end architecture to perform joint handwriting recognition and IE without segmentation, leveraging curriculum learning, synthetic data, and a novel hierarchical named-entity encoding to capture complex relationships among entities. The authors demonstrate state-of-the-art page-level information extraction on Esposalles and establish strong baselines on M-POPP, while analyzing how different encoding schemes for nested entities affect performance. They also provide preprocessing strategies to manage large page images and discuss plans to integrate language-model based NER in future work, highlighting the practical impact for large-scale historical document analytics and digital humanities research.

Abstract

The EXO-POPP project aims to establish a comprehensive database comprising 300,000 marriage records from Paris and its suburbs, spanning the years 1880 to 1940, which are preserved in over 130,000 scans of double pages. Each marriage record may encompass up to 118 distinct types of information that require extraction from plain text. In this paper, we introduce the M-POPP dataset, a subset of the M-POPP database with annotations for full-page text recognition and information extraction in both handwritten and printed documents, and which is now publicly available. We present a fully end-to-end architecture adapted from the DAN, designed to perform both handwritten text recognition and information extraction directly from page images without the need for explicit segmentation. We showcase the information extraction capabilities of this architecture by achieving a new state of the art for full-page Information Extraction on Esposalles and we use this architecture as a baseline for the M-POPP dataset. We also assess and compare how different encoding strategies for named entities in the text affect the performance of jointly recognizing handwritten text and extracting information, from full pages.
Paper Structure (25 sections, 6 figures, 8 tables)

This paper contains 25 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An example of a double page containing two handwritten marriage records including one with margin notes (Paris, first arrondissement, 1880). The red rectangles give an example of each type of text block (A, B, and C).
  • Figure 2: An example of a double page containing six printed marriage records including two with margin notes (Paris, first arrondissement, 1940).
  • Figure 3: (a) Sample from a marriage record from Paris, 1910. (b) View of the corresponding annotation for IE in Pivan.
  • Figure 4: Number of occurrences for each type of named entity sub-element in the printed dataset.
  • Figure 5: Number of occurrences for each type of named entity sub-element in the handwritten dataset.
  • ...and 1 more figures