Table of Contents
Fetching ...

UnSupDLA: Towards Unsupervised Document Layout Analysis

Talha Uddin Sheikh, Tahira Shehzadi, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

This paper addresses the scarcity of labeled data in Document Layout Analysis (DLA) by proposing a vision-based unsupervised pre-training pipeline that starts from unlabeled document images. It generates initial layout masks using self-supervised features from DINO and Normalized Cuts to identify multiple objects, then trains a detector with a loss-drop strategy across multiple unsupervised iterations to refine these masks. The method demonstrates strong unsupervised performance on PubLayNet, DocLayNet, and TableBank, notably achieving high mask AP on TableBank without labels and competitive box and mask metrics without supervision. The approach reduces labeling requirements and enables cross-dataset unsupervised training, offering a scalable pre-training path for DLA in diverse document collections.

Abstract

Document layout analysis is a key area in document research, involving techniques like text mining and visual analysis. Despite various methods developed to tackle layout analysis, a critical but frequently overlooked problem is the scarcity of labeled data needed for analyses. With the rise of internet use, an overwhelming number of documents are now available online, making the process of accurately labeling them for research purposes increasingly challenging and labor-intensive. Moreover, the diversity of documents online presents a unique set of challenges in maintaining the quality and consistency of these labels, further complicating document layout analysis in the digital era. To address this, we employ a vision-based approach for analyzing document layouts designed to train a network without labels. Instead, we focus on pre-training, initially generating simple object masks from the unlabeled document images. These masks are then used to train a detector, enhancing object detection and segmentation performance. The model's effectiveness is further amplified through several unsupervised training iterations, continuously refining its performance. This approach significantly advances document layout analysis, particularly precision and efficiency, without labels.

UnSupDLA: Towards Unsupervised Document Layout Analysis

TL;DR

This paper addresses the scarcity of labeled data in Document Layout Analysis (DLA) by proposing a vision-based unsupervised pre-training pipeline that starts from unlabeled document images. It generates initial layout masks using self-supervised features from DINO and Normalized Cuts to identify multiple objects, then trains a detector with a loss-drop strategy across multiple unsupervised iterations to refine these masks. The method demonstrates strong unsupervised performance on PubLayNet, DocLayNet, and TableBank, notably achieving high mask AP on TableBank without labels and competitive box and mask metrics without supervision. The approach reduces labeling requirements and enables cross-dataset unsupervised training, offering a scalable pre-training path for DLA in diverse document collections.

Abstract

Document layout analysis is a key area in document research, involving techniques like text mining and visual analysis. Despite various methods developed to tackle layout analysis, a critical but frequently overlooked problem is the scarcity of labeled data needed for analyses. With the rise of internet use, an overwhelming number of documents are now available online, making the process of accurately labeling them for research purposes increasingly challenging and labor-intensive. Moreover, the diversity of documents online presents a unique set of challenges in maintaining the quality and consistency of these labels, further complicating document layout analysis in the digital era. To address this, we employ a vision-based approach for analyzing document layouts designed to train a network without labels. Instead, we focus on pre-training, initially generating simple object masks from the unlabeled document images. These masks are then used to train a detector, enhancing object detection and segmentation performance. The model's effectiveness is further amplified through several unsupervised training iterations, continuously refining its performance. This approach significantly advances document layout analysis, particularly precision and efficiency, without labels.
Paper Structure (16 sections, 5 equations, 2 figures, 6 tables)

This paper contains 16 sections, 5 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of our unsupervised training module: It takes unlabeled data to train models for object detection and instance segmentation. Then, Objects Masking tokencut_TPAMI23 generates rough object masks utilizing the features of self-supervised DINO DINO_selfsup3. We employ a patch-wise similarity matrix for multiple object masks in an unlabeled image. Applying Normalized Cuts (Ncut) to this matrix, we initially extract a mask for a single foreground object. This procedure is repeated, altering the affinity matrix each time, allowing Objects Masking to discover multiple object masks in one image, demonstrated here with eight iterations.
  • Figure 2: Comparative visual analysis of unsupervised learning on the PubLayNet dataset: top-predicted layouts; bottom-corresponding ground-truth layouts. The model's proficiency in detecting details overlooked by human annotators is also highlighted, marked by red arrows.