Table of Contents
Fetching ...

You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

Thibault Clérice

TL;DR

The paper addresses robust layout analysis for historical documents, where traditional Kraken-based segmentation often fails on small datasets. It advocates reframing segmentation as an object-detection problem using isothetic bounding boxes via YOLOv5, and demonstrates substantial accuracy and efficiency gains over Kraken on two historical datasets. The authors introduce two new datasets (YALTAi-MSS-EPB and YALTAi-Tables), provide an open-source integration tool (YALTAi) that plugs YOLOv5 into Kraken, and show that the approach markedly improves main-body Zone detection and column separation, with practical implications for scalable text extraction from historical corpora. Limitations include the isothetic-box constraint and potential benefits from oriented bounding boxes; the work offers a clear path toward more reliable, efficient layout analysis in humanities research pipelines.

Abstract

Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.

You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

TL;DR

The paper addresses robust layout analysis for historical documents, where traditional Kraken-based segmentation often fails on small datasets. It advocates reframing segmentation as an object-detection problem using isothetic bounding boxes via YOLOv5, and demonstrates substantial accuracy and efficiency gains over Kraken on two historical datasets. The authors introduce two new datasets (YALTAi-MSS-EPB and YALTAi-Tables), provide an open-source integration tool (YALTAi) that plugs YOLOv5 into Kraken, and show that the approach markedly improves main-body Zone detection and column separation, with practical implications for scalable text extraction from historical corpora. Limitations include the isothetic-box constraint and potential benefits from oriented bounding boxes; the work offers a clear path toward more reliable, efficient layout analysis in humanities research pipelines.

Abstract

Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.
Paper Structure (16 sections, 6 figures, 5 tables)

This paper contains 16 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Prediction on the test set with YOLOv5x models for the Segmonto dataset (three pictures on the left) and the tabular dataset (last picture, columns are in alternating colours for readability). Illustrations are in orange (first picture top), drop capitals are in darker orange, marginal text in green, yellow is the main body of text.
  • Figure 2: Typical Handwritten Text Recognition (HTR) workflow. A user uploads a set of pictures from a digitized book, segments the document both at the level of the layout and the lines, corrects the segmentation, provides a transcription or correct an automatic one and then fine-tune or creates models for other pages of the same document or documents of the same kind.
  • Figure 3: Example of polygon in the ground truth when the Kraken prediction is correct on the left. On the right, its simplification into an isothetic rectangle for object detection.
  • Figure 4: Workflow and responsibilities at inference time.
  • Figure 5: Tabular dataset excerpts. Two images on the left are from the Lectaurep dataset, three images on the right are original data for the test set.
  • ...and 1 more figures