Generative Compositor for Few-Shot Visual Information Extraction
Zhibo Yang, Wei Hua, Sibo Song, Cong Yao, Yingying Zhu, Wenqing Cheng, Xiang Bai
TL;DR
The paper tackles Visual Information Extraction (VIE) in scenarios with very limited labeled data by introducing the Generative Compositor (GC), a OCR-dependent generative model that retrieves words from source OCR blocks and assembles them into structured outputs guided by prompts. GC combines a LayoutLMv3-based Source Encoder, a Prompt-Aware Resampler, and a Generative Matcher to produce context-aware, dynamic classification weights and a matching score matrix for BIO-style labeling, with a sequential GMseq variant to handle irregular reading orders. To enable strong few-shot performance, the authors propose three pre-training tasks (Match to Fill, Search One Direction, and Search All Directions) that inject spatial and reading-order priors and leverage prompt information during decoding. Empirically, GC achieves state-of-the-art results in full-shot VIE on FUNSD, competitive results on CORD and POIE, and significant improvements in 1-/5-/10-shot settings, especially when OCR reading order is disrupted; ablations confirm the contributions of pre-training and the prompt-aware resampler. The work demonstrates that a hybrid generative–matching approach with targeted pre-training can substantially improve few-shot VIE and offers practical benefits for real-world document understanding under OCR error and layout variability.
Abstract
Visual Information Extraction (VIE), aiming at extracting structured information from visually rich document images, plays a pivotal role in document processing. Considering various layouts, semantic scopes, and languages, VIE encompasses an extensive range of types, potentially numbering in the thousands. However, many of these types suffer from a lack of training data, which poses significant challenges. In this paper, we propose a novel generative model, named Generative Compositor, to address the challenge of few-shot VIE. The Generative Compositor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text and assembling them based on the provided prompts. Furthermore, three pre-training strategies are employed to enhance the model's perception of spatial context information. Besides, a prompt-aware resampler is specially designed to enable efficient matching by leveraging the entity-semantic prior contained in prompts. The introduction of the prompt-based retrieval mechanism and the pre-training strategies enable the model to acquire more effective spatial and semantic clues with limited training samples. Experiments demonstrate that the proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.
