Table of Contents
Fetching ...

Relation-Rich Visual Document Generator for Visual Information Extraction

Zi-Han Jiang, Chien-Wei Lin, Wei-Hua Li, Hsuan-Tung Liu, Yi-Ren Yeh, Chu-Song Chen

TL;DR

RIDGE tackles the data scarcity and layout diversity in visual information extraction for relation-rich documents by introducing a two-stage synthetic data generator. It first generates structured document content with Hierarchical Structure Text (HST) using LLMs, then learns content-driven layouts (CLGM) solely from OCR results via self-supervised layout learning, enabling diverse, realistic document images without manual annotations. A Hierarchical Structure Learning framework further reinforces understanding of document hierarchies, improving VIE performance and interpretability. Across open-set benchmarks and domain-specific tasks, RIDGE consistently enhances fine-tuning of MLLMs and LayoutLMv3, demonstrating practical impact for robust visual document understanding and scalable data generation.

Abstract

Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at https://github.com/AI-Application-and-Integration-Lab/RIDGE .

Relation-Rich Visual Document Generator for Visual Information Extraction

TL;DR

RIDGE tackles the data scarcity and layout diversity in visual information extraction for relation-rich documents by introducing a two-stage synthetic data generator. It first generates structured document content with Hierarchical Structure Text (HST) using LLMs, then learns content-driven layouts (CLGM) solely from OCR results via self-supervised layout learning, enabling diverse, realistic document images without manual annotations. A Hierarchical Structure Learning framework further reinforces understanding of document hierarchies, improving VIE performance and interpretability. Across open-set benchmarks and domain-specific tasks, RIDGE consistently enhances fine-tuning of MLLMs and LayoutLMv3, demonstrating practical impact for robust visual document understanding and scalable data generation.

Abstract

Despite advances in Large Language Models (LLMs) and Multimodal LLMs (MLLMs) for visual document understanding (VDU), visual information extraction (VIE) from relation-rich documents remains challenging due to the layout diversity and limited training data. While existing synthetic document generators attempt to address data scarcity, they either rely on manually designed layouts and templates, or adopt rule-based approaches that limit layout diversity. Besides, current layout generation methods focus solely on topological patterns without considering textual content, making them impractical for generating documents with complex associations between the contents and layouts. In this paper, we propose a Relation-rIch visual Document GEnerator (RIDGE) that addresses these limitations through a two-stage approach: (1) Content Generation, which leverages LLMs to generate document content using a carefully designed Hierarchical Structure Text format which captures entity categories and relationships, and (2) Content-driven Layout Generation, which learns to create diverse, plausible document layouts solely from easily available Optical Character Recognition (OCR) results, requiring no human labeling or annotations efforts. Experimental results have demonstrated that our method significantly enhances the performance of document understanding models on various VIE benchmarks. The code and model will be available at https://github.com/AI-Application-and-Integration-Lab/RIDGE .

Paper Structure

This paper contains 29 sections, 2 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Overview of RIDGE, including (a) Content Generation and (b) Content-driven Layout Generation. In the visualized annotation, purple represents the header, red represents the key, blue represents the value, and green lines represent entity linking.
  • Figure 2: Content-driven Layout Generation Model (CLGM).
  • Figure 3: Hierarchical Structure Learning
  • Figure 4: Example of generated documents. (a) General form-like images. (b) Right: SROIE-styled image; Left: real SROIE image. (c) Bottom: EPHOIE-styled image; Top: real EPHOIE image.
  • Figure 5: Interpretability brought by VIE with Hierarchical Structure Parsing.
  • ...and 9 more figures