Table of Contents
Fetching ...

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

Xiao-Hui Li, Fei Yin, Cheng-Lin Liu

TL;DR

DocSAM introduces a transformer-based unified framework for document image segmentation that jointly handles layout analysis, multi-granularity text detection, and table structure recognition by decomposing tasks into instance and semantic queries. Semantic queries map dataset class names via Sentence-BERT, while learnable instance queries interact through a Hybrid Query Decoder to produce semantic masks, instance masks, class labels, and bounding boxes, all trained with a combination of four losses. Trained on a heterogeneous mix of about 50 DIS datasets, DocSAM demonstrates strong generalization and efficiency, with ablations showing the benefits of curriculum learning and instance query selection; some tasks remain challenging for single-modal models. The approach is practical for large-scale deployment and serves as a robust pre-trained model for downstream document understanding, with potential extensions to multi-modal DIS systems at future work.

Abstract

Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli-git/DocSAM.

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

TL;DR

DocSAM introduces a transformer-based unified framework for document image segmentation that jointly handles layout analysis, multi-granularity text detection, and table structure recognition by decomposing tasks into instance and semantic queries. Semantic queries map dataset class names via Sentence-BERT, while learnable instance queries interact through a Hybrid Query Decoder to produce semantic masks, instance masks, class labels, and bounding boxes, all trained with a combination of four losses. Trained on a heterogeneous mix of about 50 DIS datasets, DocSAM demonstrates strong generalization and efficiency, with ablations showing the benefits of curriculum learning and instance query selection; some tasks remain challenging for single-modal models. The approach is practical for large-scale deployment and serves as a robust pre-trained model for downstream document understanding, with potential extensions to multi-modal DIS systems at future work.

Abstract

Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli-git/DocSAM.

Paper Structure

This paper contains 28 sections, 8 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Examples of various segmentation tasks on heterogeneous document datasets.
  • Figure 2: Network structure of the proposed DocSAM. DocSAM unify various document image segmentation tasks into one single model through instance and semantic query decomposition and interaction. Skip connections and norm layers are omitted for simplicity.
  • Figure 3: Loss curves and on-the-fly validation during training.
  • Figure 4: Qualitative results on public document layout analysis benchmarks produced by our DocSAM model.
  • Figure 5: Qualitative results on public ancient and handwritten document segmentation benchmarks produced by our DocSAM model.
  • ...and 3 more figures