Table of Contents
Fetching ...

DLAFormer: An End-to-End Transformer For Document Layout Analysis

Jiawei Wang, Kai Hu, Qiang Huo

TL;DR

DLAFormer presents an end-to-end transformer framework for document layout analysis that unifies graphical object detection, text region detection, logical role classification, and reading order into a single relation-prediction model. By introducing a unified label space and type-aware, dynamic queries (type-wise queries) within a Deformable DETR backbone, it enables concurrent learning of multiple DLA subtasks and robust handling of diverse page objects via a coarse-to-fine refinement. The approach achieves state-of-the-art or competitive results on Comp-HRDoc and DocLayNet, demonstrating reduced error propagation and improved efficiency over multi-branch or multi-stage architectures. The work highlights the potential of relation-prediction paradigms for scalable, extensible document understanding and sets the stage for incorporating additional DLA tasks and text embeddings in future work.

Abstract

Document layout analysis (DLA) is crucial for understanding the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. However, previous studies have typically used separate models to address individual sub-tasks within DLA, including table/figure detection, text region detection, logical role classification, and reading order prediction. In this work, we propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer, which integrates all these sub-tasks into a single model. To achieve this, we treat various DLA sub-tasks (such as text region detection, logical role classification, and reading order prediction) as relation prediction problems and consolidate these relation prediction labels into a unified label space, allowing a unified relation prediction module to handle multiple tasks concurrently. Additionally, we introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR. Moreover, we adopt a coarse-to-fine strategy to accurately identify graphical page objects. Experimental results demonstrate that our proposed DLAFormer outperforms previous approaches that employ multi-branch or multi-stage architectures for multiple tasks on two document layout analysis benchmarks, DocLayNet and Comp-HRDoc.

DLAFormer: An End-to-End Transformer For Document Layout Analysis

TL;DR

DLAFormer presents an end-to-end transformer framework for document layout analysis that unifies graphical object detection, text region detection, logical role classification, and reading order into a single relation-prediction model. By introducing a unified label space and type-aware, dynamic queries (type-wise queries) within a Deformable DETR backbone, it enables concurrent learning of multiple DLA subtasks and robust handling of diverse page objects via a coarse-to-fine refinement. The approach achieves state-of-the-art or competitive results on Comp-HRDoc and DocLayNet, demonstrating reduced error propagation and improved efficiency over multi-branch or multi-stage architectures. The work highlights the potential of relation-prediction paradigms for scalable, extensible document understanding and sets the stage for incorporating additional DLA tasks and text embeddings in future work.

Abstract

Document layout analysis (DLA) is crucial for understanding the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. However, previous studies have typically used separate models to address individual sub-tasks within DLA, including table/figure detection, text region detection, logical role classification, and reading order prediction. In this work, we propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer, which integrates all these sub-tasks into a single model. To achieve this, we treat various DLA sub-tasks (such as text region detection, logical role classification, and reading order prediction) as relation prediction problems and consolidate these relation prediction labels into a unified label space, allowing a unified relation prediction module to handle multiple tasks concurrently. Additionally, we introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR. Moreover, we adopt a coarse-to-fine strategy to accurately identify graphical page objects. Experimental results demonstrate that our proposed DLAFormer outperforms previous approaches that employ multi-branch or multi-stage architectures for multiple tasks on two document layout analysis benchmarks, DocLayNet and Comp-HRDoc.
Paper Structure (24 sections, 2 equations, 3 figures, 6 tables)

This paper contains 24 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An example of our problem definition for document layout analysis. Purple arrow: intra-region relationship; green arrow: inter-region relationship; orange arrow: logical role relationship. Best viewed in color.
  • Figure 2: Overall architecture of our DLAFormer for document layout analysis.
  • Figure 3: The unified label space in DLAFormer. $T_i$ denotes Text-line queries, $G_i$ denotes Graphical object queries, and $L_i$ denotes Logical role queries. The purple grids illustrate intra-region relationships, the green grids represent inter-region relationships, and the orange grids signify logical role relationships.