Table of Contents
Fetching ...

Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis

Jiawei Wang, Kai Hu, Zhuoyao Zhong, Lei Sun, Qiang Huo

TL;DR

This work tackles hierarchical document structure analysis (HDSA) by proposing Detect-Order-Construct, a tree-construction framework that unifies page object detection, reading order prediction, and TOC-based hierarchy reconstruction as relation-prediction tasks. The Detect module performs hybrid top-down and bottom-up detection with multi-modal text and vision features; the Order module propagates inter-region reading order using structure-aware transformers; the Construct module derives a hierarchical TOC via a tree-aware relation head and a tree insertion decoding strategy. A new Comp-HRDoc benchmark is introduced to evaluate all four subtasks end-to-end on HRDoc-derived data, and extensive experiments show state-of-the-art performance on PubLayNet, DocLayNet, HRDoc, and Comp-HRDoc. The approach demonstrates strong practical impact for robust, end-to-end document understanding across diverse layouts and provides a framework for future extensions to broader document types and graph-based hierarchies.

Abstract

Document structure analysis (aka document layout analysis) is crucial for understanding the physical layout and logical structure of documents, with applications in information retrieval, document summarization, knowledge extraction, etc. In this paper, we concentrate on Hierarchical Document Structure Analysis (HDSA) to explore hierarchical relationships within structured documents created using authoring software employing hierarchical schemas, such as LaTeX, Microsoft Word, and HTML. To comprehensively analyze hierarchical document structures, we propose a tree construction based approach that addresses multiple subtasks concurrently, including page object detection (Detect), reading order prediction of identified objects (Order), and the construction of intended hierarchical structure (Construct). We present an effective end-to-end solution based on this framework to demonstrate its performance. To assess our approach, we develop a comprehensive benchmark called Comp-HRDoc, which evaluates the above subtasks simultaneously. Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets (PubLayNet and DocLayNet), a high-quality hierarchical document structure reconstruction dataset (HRDoc), and our Comp-HRDoc benchmark. The Comp-HRDoc benchmark will be released to facilitate further research in this field.

Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis

TL;DR

This work tackles hierarchical document structure analysis (HDSA) by proposing Detect-Order-Construct, a tree-construction framework that unifies page object detection, reading order prediction, and TOC-based hierarchy reconstruction as relation-prediction tasks. The Detect module performs hybrid top-down and bottom-up detection with multi-modal text and vision features; the Order module propagates inter-region reading order using structure-aware transformers; the Construct module derives a hierarchical TOC via a tree-aware relation head and a tree insertion decoding strategy. A new Comp-HRDoc benchmark is introduced to evaluate all four subtasks end-to-end on HRDoc-derived data, and extensive experiments show state-of-the-art performance on PubLayNet, DocLayNet, HRDoc, and Comp-HRDoc. The approach demonstrates strong practical impact for robust, end-to-end document understanding across diverse layouts and provides a framework for future extensions to broader document types and graph-based hierarchies.

Abstract

Document structure analysis (aka document layout analysis) is crucial for understanding the physical layout and logical structure of documents, with applications in information retrieval, document summarization, knowledge extraction, etc. In this paper, we concentrate on Hierarchical Document Structure Analysis (HDSA) to explore hierarchical relationships within structured documents created using authoring software employing hierarchical schemas, such as LaTeX, Microsoft Word, and HTML. To comprehensively analyze hierarchical document structures, we propose a tree construction based approach that addresses multiple subtasks concurrently, including page object detection (Detect), reading order prediction of identified objects (Order), and the construction of intended hierarchical structure (Construct). We present an effective end-to-end solution based on this framework to demonstrate its performance. To assess our approach, we develop a comprehensive benchmark called Comp-HRDoc, which evaluates the above subtasks simultaneously. Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets (PubLayNet and DocLayNet), a high-quality hierarchical document structure reconstruction dataset (HRDoc), and our Comp-HRDoc benchmark. The Comp-HRDoc benchmark will be released to facilitate further research in this field.
Paper Structure (27 sections, 14 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 14 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of our tree construction based approach, named Detect-Order-Construct, for hierarchical document structure analysis.
  • Figure 2: Hierarchical structure reconstruction of a document by integrating the Reading Order and Table of Contents. Blue arrows demonstrate the Text Region Reading Order Relationship, green arrows show the Graphical Region Relationship, and red arrows signify the TOC Relationship. The nodes "P", "S", "C", "T" and "F" represent Paragraph, Section heading, Caption, Table and Footnote, respectively.
  • Figure 3: The overall architecture of our Detect module.
  • Figure 4: A schematic view of the proposed bottom-up text region detection model.
  • Figure 5: Illustration of (a) Multi-modal Feature Enhancement Module; (b) Logical Role Classification Head; (c) Reading Order Relation Prediction Head in bottom-up text region detection model.
  • ...and 4 more figures