Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis
Jiawei Wang, Kai Hu, Zhuoyao Zhong, Lei Sun, Qiang Huo
TL;DR
This work tackles hierarchical document structure analysis (HDSA) by proposing Detect-Order-Construct, a tree-construction framework that unifies page object detection, reading order prediction, and TOC-based hierarchy reconstruction as relation-prediction tasks. The Detect module performs hybrid top-down and bottom-up detection with multi-modal text and vision features; the Order module propagates inter-region reading order using structure-aware transformers; the Construct module derives a hierarchical TOC via a tree-aware relation head and a tree insertion decoding strategy. A new Comp-HRDoc benchmark is introduced to evaluate all four subtasks end-to-end on HRDoc-derived data, and extensive experiments show state-of-the-art performance on PubLayNet, DocLayNet, HRDoc, and Comp-HRDoc. The approach demonstrates strong practical impact for robust, end-to-end document understanding across diverse layouts and provides a framework for future extensions to broader document types and graph-based hierarchies.
Abstract
Document structure analysis (aka document layout analysis) is crucial for understanding the physical layout and logical structure of documents, with applications in information retrieval, document summarization, knowledge extraction, etc. In this paper, we concentrate on Hierarchical Document Structure Analysis (HDSA) to explore hierarchical relationships within structured documents created using authoring software employing hierarchical schemas, such as LaTeX, Microsoft Word, and HTML. To comprehensively analyze hierarchical document structures, we propose a tree construction based approach that addresses multiple subtasks concurrently, including page object detection (Detect), reading order prediction of identified objects (Order), and the construction of intended hierarchical structure (Construct). We present an effective end-to-end solution based on this framework to demonstrate its performance. To assess our approach, we develop a comprehensive benchmark called Comp-HRDoc, which evaluates the above subtasks simultaneously. Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets (PubLayNet and DocLayNet), a high-quality hierarchical document structure reconstruction dataset (HRDoc), and our Comp-HRDoc benchmark. The Comp-HRDoc benchmark will be released to facilitate further research in this field.
