Table of Contents
Fetching ...

The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses

Mélodie Boillet, Solène Tarride, Manon Blanco, Valentin Rigal, Yoann Schneider, Bastien Abadie, Lionel Kesztenbaum, Christopher Kermorvant

TL;DR

The paper tackles large-scale extraction of nominative census data from century-spanning handwritten French census lists (1836–1936) across 94 archives. It introduces a unified full-page transformer-based model (DAN) for end-to-end table recognition, coupled with a robust data collection/normalization pipeline (Socface-Spider) and HPC-enabled processing via Arkindex with SLURM under PySlurm. Ground-truth is generated through Callico annotations, and the workflow achieves highList-page classification accuracy and substantial household extraction performance, enabling the construction of a unified, searchable database while processing hundreds of millions of records. The work demonstrates practical impact by enabling public access to historical demographic data and providing a scalable platform for archival data extraction and analysis, with plans to extend context continuity across pages and to incorporate address-level information.

Abstract

This paper presents a complete processing workflow for extracting information from French census lists from 1836 to 1936. These lists contain information about individuals living in France and their households. We aim at extracting all the information contained in these tables using automatic handwritten table recognition. At the end of the Socface project, in which our work is taking place, the extracted information will be redistributed to the departmental archives, and the nominative lists will be freely available to the public, allowing anyone to browse hundreds of millions of records. The extracted data will be used by demographers to analyze social change over time, significantly improving our understanding of French economic and social structures. For this project, we developed a complete processing workflow: large-scale data collection from French departmental archives, collaborative annotation of documents, training of handwritten table text and structure recognition models, and mass processing of millions of images. We present the tools we have developed to easily collect and process millions of pages. We also show that it is possible to process such a wide variety of tables with a single table recognition model that uses the image of the entire page to recognize information about individuals, categorize them and automatically group them into households. The entire process has been successfully used to process the documents of a departmental archive, representing more than 450,000 images.

The Socface Project: Large-Scale Collection, Processing, and Analysis of a Century of French Censuses

TL;DR

The paper tackles large-scale extraction of nominative census data from century-spanning handwritten French census lists (1836–1936) across 94 archives. It introduces a unified full-page transformer-based model (DAN) for end-to-end table recognition, coupled with a robust data collection/normalization pipeline (Socface-Spider) and HPC-enabled processing via Arkindex with SLURM under PySlurm. Ground-truth is generated through Callico annotations, and the workflow achieves highList-page classification accuracy and substantial household extraction performance, enabling the construction of a unified, searchable database while processing hundreds of millions of records. The work demonstrates practical impact by enabling public access to historical demographic data and providing a scalable platform for archival data extraction and analysis, with plans to extend context continuity across pages and to incorporate address-level information.

Abstract

This paper presents a complete processing workflow for extracting information from French census lists from 1836 to 1936. These lists contain information about individuals living in France and their households. We aim at extracting all the information contained in these tables using automatic handwritten table recognition. At the end of the Socface project, in which our work is taking place, the extracted information will be redistributed to the departmental archives, and the nominative lists will be freely available to the public, allowing anyone to browse hundreds of millions of records. The extracted data will be used by demographers to analyze social change over time, significantly improving our understanding of French economic and social structures. For this project, we developed a complete processing workflow: large-scale data collection from French departmental archives, collaborative annotation of documents, training of handwritten table text and structure recognition models, and mass processing of millions of images. We present the tools we have developed to easily collect and process millions of pages. We also show that it is possible to process such a wide variety of tables with a single table recognition model that uses the image of the entire page to recognize information about individuals, categorize them and automatically group them into households. The entire process has been successfully used to process the documents of a departmental archive, representing more than 450,000 images.
Paper Structure (23 sections, 5 figures, 2 tables)

This paper contains 23 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: First page of nominal lists for the commune of Moulins (department of Allier) for three census years. The quality of the pages varies greatly from one year to the next. In addition, the table template evolved over the years: in 1881, civil status was replaced by the column marking the position in the household. In 1906, age is replaced by year of birth, as can be seen on the 1936 example.
  • Figure 2: Configuration interface for retrieving and organizing data from the input CSV file. The "Name" column indicates the fields present in the CSV file. The "Type" column indicates how the CSV fields will be used (whether it corresponds to the year or commune, or if the field should be ignored). If the data displayed in the "Values sample" column is correct, the user will see a preview of the retrieved images with their metadata.
  • Figure 3: Example of digitized pages from the census of the commune of Moulins (department of Allier) in 1881.
  • Figure 4: Callico interfaces for annotating information on individuals and grouping them into households.
  • Figure 5: Table header and first rows of a table from the census of the commune of Neuilly-le-Réal (department of Allier) in 1901. The label used to train the model for this part of the table is: <s-h>Gendre <f>Pierre <o>cultivateur <l>chef <e>patron <a>75 <n>française <s>Paraud <f>Marie <o>néant <l>épouse <e>néant <a>66 <n>idem <s-h>Martin <f>Pierre <o>métayer <l>chef <e>patron <a>69 <n>idem <s>Joyoz <f>Suzanne <o>néant <l>mère <e>néant <a>72 <n>idem ... Note that the order of the entities in the labels is always the same and does not always correspond to the order in which the information appears in the images, as there are multiple templates.