TextBite: A Historical Czech Document Dataset for Logical Page Segmentation
Martin Kostelník, Karel Beneš, Michal Hradiš
TL;DR
The paper addresses OCR-free logical page segmentation by reframing it as pixel clustering of foreground text pixels, introducing the TextBite dataset of 8,449 historical Czech pages with 78,863 annotated segments and reading-order relations in an extended COCO format. It provides an evaluation framework based on per-pixel Rand index, enabling fair comparisons across segmentation approaches without OCR dependence. Three baselines—YOLO-based detection, a graph neural network for region merging, and a transformer-based relation predictor—show that graph-based merging achieves the strongest performance (Rand index 92.5%), with transformer benefits from visual context. TextBite thus delivers a valuable resource and methodology for OCR-independent document segmentation and layout analysis, with public availability to spur further advances.
Abstract
Logical page segmentation is an important step in document analysis, enabling better semantic representations, information retrieval, and text understanding. Previous approaches define logical segmentation either through text or geometric objects, relying on OCR or precise geometry. To avoid the need for OCR, we define the task purely as segmentation in the image domain. Furthermore, to ensure the evaluation remains unaffected by geometrical variations that do not impact text segmentation, we propose to use only foreground text pixels in the evaluation metric and disregard all background pixels. To support research in logical document segmentation, we introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text. We propose a set of baseline methods combining text region detection and relation prediction. The dataset, baselines and evaluation framework can be accessed at https://github.com/DCGM/textbite-dataset.
