Table of Contents
Fetching ...

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

Martin Kostelník, Karel Beneš, Michal Hradiš

TL;DR

The paper addresses OCR-free logical page segmentation by reframing it as pixel clustering of foreground text pixels, introducing the TextBite dataset of 8,449 historical Czech pages with 78,863 annotated segments and reading-order relations in an extended COCO format. It provides an evaluation framework based on per-pixel Rand index, enabling fair comparisons across segmentation approaches without OCR dependence. Three baselines—YOLO-based detection, a graph neural network for region merging, and a transformer-based relation predictor—show that graph-based merging achieves the strongest performance (Rand index 92.5%), with transformer benefits from visual context. TextBite thus delivers a valuable resource and methodology for OCR-independent document segmentation and layout analysis, with public availability to spur further advances.

Abstract

Logical page segmentation is an important step in document analysis, enabling better semantic representations, information retrieval, and text understanding. Previous approaches define logical segmentation either through text or geometric objects, relying on OCR or precise geometry. To avoid the need for OCR, we define the task purely as segmentation in the image domain. Furthermore, to ensure the evaluation remains unaffected by geometrical variations that do not impact text segmentation, we propose to use only foreground text pixels in the evaluation metric and disregard all background pixels. To support research in logical document segmentation, we introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text. We propose a set of baseline methods combining text region detection and relation prediction. The dataset, baselines and evaluation framework can be accessed at https://github.com/DCGM/textbite-dataset.

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

TL;DR

The paper addresses OCR-free logical page segmentation by reframing it as pixel clustering of foreground text pixels, introducing the TextBite dataset of 8,449 historical Czech pages with 78,863 annotated segments and reading-order relations in an extended COCO format. It provides an evaluation framework based on per-pixel Rand index, enabling fair comparisons across segmentation approaches without OCR dependence. Three baselines—YOLO-based detection, a graph neural network for region merging, and a transformer-based relation predictor—show that graph-based merging achieves the strongest performance (Rand index 92.5%), with transformer benefits from visual context. TextBite thus delivers a valuable resource and methodology for OCR-independent document segmentation and layout analysis, with public availability to spur further advances.

Abstract

Logical page segmentation is an important step in document analysis, enabling better semantic representations, information retrieval, and text understanding. Previous approaches define logical segmentation either through text or geometric objects, relying on OCR or precise geometry. To avoid the need for OCR, we define the task purely as segmentation in the image domain. Furthermore, to ensure the evaluation remains unaffected by geometrical variations that do not impact text segmentation, we propose to use only foreground text pixels in the evaluation metric and disregard all background pixels. To support research in logical document segmentation, we introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text. We propose a set of baseline methods combining text region detection and relation prediction. The dataset, baselines and evaluation framework can be accessed at https://github.com/DCGM/textbite-dataset.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Example of logical segmentation of a newspaper page as proposed in TextBite. Each segment is denoted in a different color. Only the colored pixels are considered in the evaluation. Page metadata, such as the title or edition are not included in the segmentation, as they are not a thematically coherent segments. However, our evaluation scheme does not penalize marking these parts of the page as additional segments.
  • Figure 2: Examples of annotated pages in in TextBite dataset with various layouts.
  • Figure 3: Cropped section of an annotated image showcasing an enclosed title region (green) in a text region (blue).
  • Figure 4: Data characteristics of the TextBite dataset.
  • Figure 5: Construction of the ground truth pixel segmentation of a page. First, we turn the human annotation in the form of connected regions \ref{['fig:subfig1:page-from-label-studio']} into their masked representation \ref{['fig:subfig2:gold-annotation']}. This is then intersected with textlines obtained from OCR and thresholded, which results in the final pixel segmentation \ref{['fig:subfig3:pixel-segmentation']}.
  • ...and 1 more figures