Table of Contents
Fetching ...

OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning

Hengrui Kang, Zhuangcheng Gu, Zhiyuan Zhao, Zichen Wen, Bin Wang, Weijia Li, Conghui He

TL;DR

The paper tackles diverse document layout generation by introducing OmniDocLayout-1M, the first million-sample dataset spanning six real-world document types, and OmniDocLayout-LLM, a 0.5B model trained under a two-stage coarse-to-fine paradigm. Stage 1 learns universal layout priors from OmniDocLayout-1M with coarse labels, while Stage 2 adapts to a target domain using a small set of fine-grained annotations via a label mapping $\phi$, enabling effective domain-specific generation with limited supervision. Experiments across M6Doc show state-of-the-art performance versus both layout experts and general-purpose LLMs, with strong gains in mIoU and alignment, and human evaluation indicating perceptual parity with human layouts. The work provides a scalable, adaptable framework and demonstrates the practicality of large-language-model-based layout generation for diverse, long-sequence document layouts, while highlighting the need for improved metrics for complex structures.

Abstract

Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, layout generation, remains underexplored. Distinct from traditional graphic layout design and room layout planning, document layout generation typically involves a larger number of elements per page and exhibits greater structural diversity and complexity. Currently, a major obstacle lies in the scarcity of diverse document layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniDocLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniDocLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm:1) learning universal layout principles from our dataset with coarse category definitions, and 2) transferring the knowledge to a specific domain with few fine-grained annotated samples. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^6$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, dataset, and models will be publicly released.

OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning

TL;DR

The paper tackles diverse document layout generation by introducing OmniDocLayout-1M, the first million-sample dataset spanning six real-world document types, and OmniDocLayout-LLM, a 0.5B model trained under a two-stage coarse-to-fine paradigm. Stage 1 learns universal layout priors from OmniDocLayout-1M with coarse labels, while Stage 2 adapts to a target domain using a small set of fine-grained annotations via a label mapping , enabling effective domain-specific generation with limited supervision. Experiments across M6Doc show state-of-the-art performance versus both layout experts and general-purpose LLMs, with strong gains in mIoU and alignment, and human evaluation indicating perceptual parity with human layouts. The work provides a scalable, adaptable framework and demonstrates the practicality of large-language-model-based layout generation for diverse, long-sequence document layouts, while highlighting the need for improved metrics for complex structures.

Abstract

Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, layout generation, remains underexplored. Distinct from traditional graphic layout design and room layout planning, document layout generation typically involves a larger number of elements per page and exhibits greater structural diversity and complexity. Currently, a major obstacle lies in the scarcity of diverse document layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniDocLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniDocLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm:1) learning universal layout principles from our dataset with coarse category definitions, and 2) transferring the knowledge to a specific domain with few fine-grained annotated samples. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in MDoc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, dataset, and models will be publicly released.

Paper Structure

This paper contains 28 sections, 2 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Overview of OmniDocLayout. (Top & Middle) show the curation process and examples of OmniDocLayout-1M. (Bottom) illustrates diverse layouts unconditionally generated by our OmniDocLayout-LLM.
  • Figure 2: Statistical Analysis of OmniDocLayout-1M. (a) & (b) show the multi-dimensional diversity, (c) proves its consistency with prior knowledge.
  • Figure 3: Overview of our layout generation framework (OmniDocLayout-LLM). Left: The unified layout prompt consists of a Base Prompt (document metadata), a Condition Prompt for U-Cond, C$\rightarrow$S+P, C+S$\rightarrow$P, Completion, and Refinement, and a Task Prompt defining the layout objective. Right: A Coarse-to-Fine Mapping$\phi : \mathcal{C}_{\text{coar}} \rightarrow \mathcal{C}_{\text{fine}}$ transfers coarse layout categories into fine-grained domain-specific labels.
  • Figure 4: Visualization Examples of Various Methods with U-Cond Task. For general-purpose LLMs, we adopt the strongest 5-shot setting.
  • Figure 5: Scenario Scope Comparison. (Left) shows a generated room layout by nauata2020house. (Middle) shows a generated graphic layout by Hsu_2023_CVPR. (Right) shows a generated document layout by our OmniDocLayout-LLM.
  • ...and 14 more figures