HybriDLA: Hybrid Generation for Document Layout Analysis

Yufan Chen; Omar Moured; Ruiping Liu; Junwei Zheng; Kunyu Peng; Jiaming Zhang; Rainer Stiefelhagen

HybriDLA: Hybrid Generation for Document Layout Analysis

Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen

TL;DR

HybriDLA tackles the challenge of highly variable document layouts by unifying diffusion-based refinement with autoregressive query expansion in a single, end-to-end framework. A multi-scale Feature Fusion Encoder provides rich, hierarchical visual context, while the Hybrid Generative Decoder performs coarse-to-fine layout generation through AQE and iterative DR. Empirical results on DocLayNet and M$^6$Doc show state-of-the-art performance for vision-only document layout analysis, with $mAP$ scores reaching 83.5\% on DocLayNet with InternImage and 71.4\% on M$^6$Doc, demonstrating strong generality across backbones. The approach narrows the gap to multimodal methods and offers a flexible, backbone-agnostic solution for complex page layouts, though future work should integrate textual and metadata cues to further boost accuracy and efficiency.

Abstract

Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.

HybriDLA: Hybrid Generation for Document Layout Analysis

TL;DR

Abstract

HybriDLA: Hybrid Generation for Document Layout Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)