Table of Contents
Fetching ...

Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents

Rafael Sterzinger, Tingyu Lin, Robert Sablatnig

TL;DR

This work tackles text line segmentation in historical documents under severe data scarcity, where large, pixel-precise annotations are impractical. It argues for a simple, data-efficient pipeline—a lightweight UNet++ model trained on small image patches—augmented with a connectivity-preserving loss to penalize line fragmentation and unwanted merges. Evaluated under a strict three-page training regime on U-DIADS-TL and adapted for DIVA-HisDB baseline detection, the approach delivers state-of-the-art results and competitive performance with drastically reduced training data, including nearly 200% gains in Recognition Accuracy and substantial improvements in Line IoU. Ablation studies confirm the effectiveness of patch-based training and the topology-focused loss, demonstrating that simpler architectures can outperform more complex models in few-shot historical document analysis, with strong generalization to other datasets.

Abstract

A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.

Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents

TL;DR

This work tackles text line segmentation in historical documents under severe data scarcity, where large, pixel-precise annotations are impractical. It argues for a simple, data-efficient pipeline—a lightweight UNet++ model trained on small image patches—augmented with a connectivity-preserving loss to penalize line fragmentation and unwanted merges. Evaluated under a strict three-page training regime on U-DIADS-TL and adapted for DIVA-HisDB baseline detection, the approach delivers state-of-the-art results and competitive performance with drastically reduced training data, including nearly 200% gains in Recognition Accuracy and substantial improvements in Line IoU. Ablation studies confirm the effectiveness of patch-based training and the topology-focused loss, demonstrating that simpler architectures can outperform more complex models in few-shot historical document analysis, with strong generalization to other datasets.

Abstract

A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our few-shot method improves SOTA on U-DIADS-TL (Top) by 200% in RA and 75% in Line IoU, and matches or surpasses DIVA-HisDB (Bottom) top scores using just three annotated pages.
  • Figure 2: An illustration of the parameters $\alpha$ and $\beta$ and their effect on the loss using a sample patch of Syriaque 341: they control the trade-off between pixel- and structure-level errors, and between split and merge penalties, respectively grim2025efficient.
  • Figure 3: Example excerpts of the CB55 manuscript in DIVA-HisDB illustrating the issue of overlapping text line segments.
  • Figure 4: Ablation on Network Depth.
  • Figure 5: Ablating the Hyper-Parameters of the Connectivity Loss.