Table of Contents
Fetching ...

Diachronic Document Dataset for Semantic Layout Analysis

Thibault Clérice, Juliette Janes, Hugo Scheithauer, Sarah Bénière, Florian Cafiero, Laurent Romary, Simon Gabay, Benoît Sagot

TL;DR

Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.

Abstract

We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.

Diachronic Document Dataset for Semantic Layout Analysis

TL;DR

Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.

Abstract

We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.

Paper Structure

This paper contains 18 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Examples of different layouts from subsets of the dataset.
  • Figure 2: Distribution over time of documents based on our subsets.
  • Figure 3: Distribution of class instances over time.
  • Figure 4: Curve of the mAP50 based on the input size and the model size.
  • Figure 5: mAP50 across YOLOv11 best configurations starting from the medium model on each 100 pages subsets.