Table of Contents
Fetching ...

HybriDLA: Hybrid Generation for Document Layout Analysis

Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen

TL;DR

HybriDLA tackles the challenge of highly variable document layouts by unifying diffusion-based refinement with autoregressive query expansion in a single, end-to-end framework. A multi-scale Feature Fusion Encoder provides rich, hierarchical visual context, while the Hybrid Generative Decoder performs coarse-to-fine layout generation through AQE and iterative DR. Empirical results on DocLayNet and M$^6$Doc show state-of-the-art performance for vision-only document layout analysis, with $mAP$ scores reaching 83.5\% on DocLayNet with InternImage and 71.4\% on M$^6$Doc, demonstrating strong generality across backbones. The approach narrows the gap to multimodal methods and offers a flexible, backbone-agnostic solution for complex page layouts, though future work should integrate textual and metadata cues to further boost accuracy and efficiency.

Abstract

Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.

HybriDLA: Hybrid Generation for Document Layout Analysis

TL;DR

HybriDLA tackles the challenge of highly variable document layouts by unifying diffusion-based refinement with autoregressive query expansion in a single, end-to-end framework. A multi-scale Feature Fusion Encoder provides rich, hierarchical visual context, while the Hybrid Generative Decoder performs coarse-to-fine layout generation through AQE and iterative DR. Empirical results on DocLayNet and MDoc show state-of-the-art performance for vision-only document layout analysis, with scores reaching 83.5\% on DocLayNet with InternImage and 71.4\% on MDoc, demonstrating strong generality across backbones. The approach narrows the gap to multimodal methods and offers a flexible, backbone-agnostic solution for complex page layouts, though future work should integrate textual and metadata cues to further boost accuracy and efficiency.

Abstract

Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and MDoc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.

Paper Structure

This paper contains 22 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) The progressive prediction behaviors of representative paradigms in our proposed HybriDLA ensure successive intermediate outputs obtained during the forward pass from left to right. (b) On the DocLayNet pfitzmann2022doclaynet dataset, we compare the performance (mAP in %) of the best models from five different DLA methods, i.e., traditional region-based method, DETR-based method, diffusion-based method, autoregressive method, and our proposed HybriDLA method.
  • Figure 2: Overview of the HybriDLA architecture. The framework consists of a feature fusion encoder and a hybrid generative decoder. The encoder aggregates multi-scale visual features via convolutional and transformer layers, producing a layout-aware representation. The decoder operates in two mechanisms: it performs autoregressive query expansion to propose hierarchical layout regions, then applies a diffusion-style refinement with residual correction to denoise and adjust spatial predictions. Auxiliary queries and intermediate supervision facilitate convergence. This coarse-to-fine pipeline enables precise and adaptive generation of layout detection results.
  • Figure 3: HybriDLA with InternImage wang2023internimage Results on DocLayNet pfitzmann2022doclaynet. Each subfigure shows a document page with ground-truth annotations on the left and the results of the model on the right.
  • Figure 4: Typical failure cases of HybriDLA with InternImage wang2023internimage on DocLayNet pfitzmann2022doclaynet For each page, the left half shows ground‑truth annotations, and the right half shows prediction results.