Table of Contents
Fetching ...

PromptDLA: A Domain-aware Prompt Document Layout Analysis Framework with Descriptive Knowledge as a Cue

Zirui Zhang, Yaping Zhang, Lu Xiang, Yang Zhao, Feifei Zhai, Yu Zhou, Chengqing Zong

TL;DR

This paper introduces PromptDLA, a domain-aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA, and features a unique domain-aware prompter that customizes prompts based on the specific attributes of the data domain.

Abstract

Document Layout Analysis (DLA) is crucial for document artificial intelligence and has recently received increasing attention, resulting in an influx of large-scale public DLA datasets. Existing work often combines data from various domains in recent public DLA datasets to improve the generalization of DLA. However, directly merging these datasets for training often results in suboptimal model performance, as it overlooks the different layout structures inherent to various domains. These variations include different labeling styles, document types, and languages. This paper introduces PromptDLA, a domain-aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA. The innovative PromptDLA features a unique domain-aware prompter that customizes prompts based on the specific attributes of the data domain. These prompts then serve as cues that direct the DLA toward critical features and structures within the data, enhancing the model's ability to generalize across varied domains. Extensive experiments show that our proposal achieves state-of-the-art performance among DocLayNet, PubLayNet, M6Doc, and D$^4$LA. Our code is available at https://github.com/Zirui00/PromptDLA.

PromptDLA: A Domain-aware Prompt Document Layout Analysis Framework with Descriptive Knowledge as a Cue

TL;DR

This paper introduces PromptDLA, a domain-aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA, and features a unique domain-aware prompter that customizes prompts based on the specific attributes of the data domain.

Abstract

Document Layout Analysis (DLA) is crucial for document artificial intelligence and has recently received increasing attention, resulting in an influx of large-scale public DLA datasets. Existing work often combines data from various domains in recent public DLA datasets to improve the generalization of DLA. However, directly merging these datasets for training often results in suboptimal model performance, as it overlooks the different layout structures inherent to various domains. These variations include different labeling styles, document types, and languages. This paper introduces PromptDLA, a domain-aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA. The innovative PromptDLA features a unique domain-aware prompter that customizes prompts based on the specific attributes of the data domain. These prompts then serve as cues that direct the DLA toward critical features and structures within the data, enhancing the model's ability to generalize across varied domains. Extensive experiments show that our proposal achieves state-of-the-art performance among DocLayNet, PubLayNet, M6Doc, and DLA. Our code is available at https://github.com/Zirui00/PromptDLA.
Paper Structure (21 sections, 5 equations, 10 figures, 17 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 10 figures, 17 tables, 1 algorithm.

Figures (10)

  • Figure 1: Examples of different domain differences across (a) Different document types caused variations in layout structure and element distribution (financial report, manual, patent) (b) different language types, and (c)Inconsistent labeling styles. Note that the "text" and "list" items in DocLayNet are labeled as smaller individual units while they are integrated as a whole in DocBank.
  • Figure 2: Overview of the PromptDLA method for domain-aware layout prediction. A Domain-Aware Prompter encodes domain information into a prompt vector, which is prepended to the sequence of image patch embeddings. This augmented input is processed by a vision backbone. Multi-scale features are extracted from the backbone and refined by a FPN before being passed to a detection head for final layout prediction.
  • Figure 3: Comparison with Pre-training Paradigms in DLA
  • Figure 4: Framework of Prompt Generator.
  • Figure 5: Framework of Fusion Layer.
  • ...and 5 more figures