Table of Contents
Fetching ...

Generative Compositor for Few-Shot Visual Information Extraction

Zhibo Yang, Wei Hua, Sibo Song, Cong Yao, Yingying Zhu, Wenqing Cheng, Xiang Bai

TL;DR

The paper tackles Visual Information Extraction (VIE) in scenarios with very limited labeled data by introducing the Generative Compositor (GC), a OCR-dependent generative model that retrieves words from source OCR blocks and assembles them into structured outputs guided by prompts. GC combines a LayoutLMv3-based Source Encoder, a Prompt-Aware Resampler, and a Generative Matcher to produce context-aware, dynamic classification weights and a matching score matrix for BIO-style labeling, with a sequential GMseq variant to handle irregular reading orders. To enable strong few-shot performance, the authors propose three pre-training tasks (Match to Fill, Search One Direction, and Search All Directions) that inject spatial and reading-order priors and leverage prompt information during decoding. Empirically, GC achieves state-of-the-art results in full-shot VIE on FUNSD, competitive results on CORD and POIE, and significant improvements in 1-/5-/10-shot settings, especially when OCR reading order is disrupted; ablations confirm the contributions of pre-training and the prompt-aware resampler. The work demonstrates that a hybrid generative–matching approach with targeted pre-training can substantially improve few-shot VIE and offers practical benefits for real-world document understanding under OCR error and layout variability.

Abstract

Visual Information Extraction (VIE), aiming at extracting structured information from visually rich document images, plays a pivotal role in document processing. Considering various layouts, semantic scopes, and languages, VIE encompasses an extensive range of types, potentially numbering in the thousands. However, many of these types suffer from a lack of training data, which poses significant challenges. In this paper, we propose a novel generative model, named Generative Compositor, to address the challenge of few-shot VIE. The Generative Compositor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text and assembling them based on the provided prompts. Furthermore, three pre-training strategies are employed to enhance the model's perception of spatial context information. Besides, a prompt-aware resampler is specially designed to enable efficient matching by leveraging the entity-semantic prior contained in prompts. The introduction of the prompt-based retrieval mechanism and the pre-training strategies enable the model to acquire more effective spatial and semantic clues with limited training samples. Experiments demonstrate that the proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.

Generative Compositor for Few-Shot Visual Information Extraction

TL;DR

The paper tackles Visual Information Extraction (VIE) in scenarios with very limited labeled data by introducing the Generative Compositor (GC), a OCR-dependent generative model that retrieves words from source OCR blocks and assembles them into structured outputs guided by prompts. GC combines a LayoutLMv3-based Source Encoder, a Prompt-Aware Resampler, and a Generative Matcher to produce context-aware, dynamic classification weights and a matching score matrix for BIO-style labeling, with a sequential GMseq variant to handle irregular reading orders. To enable strong few-shot performance, the authors propose three pre-training tasks (Match to Fill, Search One Direction, and Search All Directions) that inject spatial and reading-order priors and leverage prompt information during decoding. Empirically, GC achieves state-of-the-art results in full-shot VIE on FUNSD, competitive results on CORD and POIE, and significant improvements in 1-/5-/10-shot settings, especially when OCR reading order is disrupted; ablations confirm the contributions of pre-training and the prompt-aware resampler. The work demonstrates that a hybrid generative–matching approach with targeted pre-training can substantially improve few-shot VIE and offers practical benefits for real-world document understanding under OCR error and layout variability.

Abstract

Visual Information Extraction (VIE), aiming at extracting structured information from visually rich document images, plays a pivotal role in document processing. Considering various layouts, semantic scopes, and languages, VIE encompasses an extensive range of types, potentially numbering in the thousands. However, many of these types suffer from a lack of training data, which poses significant challenges. In this paper, we propose a novel generative model, named Generative Compositor, to address the challenge of few-shot VIE. The Generative Compositor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text and assembling them based on the provided prompts. Furthermore, three pre-training strategies are employed to enhance the model's perception of spatial context information. Besides, a prompt-aware resampler is specially designed to enable efficient matching by leveraging the entity-semantic prior contained in prompts. The introduction of the prompt-based retrieval mechanism and the pre-training strategies enable the model to acquire more effective spatial and semantic clues with limited training samples. Experiments demonstrate that the proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.

Paper Structure

This paper contains 19 sections, 3 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Number of categories under different sample size ranges. The statistics shows that there are 3,822 categories with sample sizes less than 5, while no more than 3 categories with ranges higher than 500.
  • Figure 2: Few-shot comparisons on CORD. Our method outperforms LayoutLMv3 and GeoLayoutLM in all four few-shot settings, with a significant advantage in the 1-shot setting.
  • Figure 3: Show case of categories with sample size less than 5 in \ref{['fig:statistics']}.
  • Figure 4: A schematic illustration of the proposed Generative Compositor. Given a document image and a prompt, the Generative Matcher produces answers with grounding information, by calculating the similarity between the source vector encoded by the Source Encoder and the matcher vector generated by the Target Generator.
  • Figure 5: A diagram of Prompt-Aware Resampler.
  • ...and 6 more figures