Table of Contents
Fetching ...

SAIL: Sample-Centric In-Context Learning for Document Information Extraction

Jinyu Zhang, Zhiyuan You, Jize Wang, Xinyi Le

TL;DR

SAIL tackles document information extraction on Visually Rich Documents under a training-free regime by introducing a sample-centric ICL strategy that builds per-sample prompts from layout similarity, entity-level text similarity, and document-level similarity. It unifies a prompt template that guides LLMs through explicit layout-text analysis and diverse exemplars, achieving state-of-the-art performance among training-free approaches and approaching fully supervised methods on FUNSD, CORD, and SROIE across GPT-3.5, GPT-4o, and ChatGLM3. Key contributions include defining layout and entity-level similarities, a per-sample adaptive prompt construction, and comprehensive ablations confirming the effectiveness of each component. The approach yields strong generalization for DIE in VRDs and highlights the practical value of adaptive, sample-centric prompts in training-free information extraction pipelines.

Abstract

Document Information Extraction (DIE) aims to extract structured information from Visually Rich Documents (VRDs). Previous full-training approaches have demonstrated strong performance but may struggle with generalization to unseen data. In contrast, training-free methods leverage powerful pre-trained models like Large Language Models (LLMs) to address various downstream tasks with only a few examples. Nonetheless, training-free methods for DIE encounter two primary challenges: (1) understanding the complex relationship between layout and textual elements in VRDs, and (2) providing accurate guidance to pre-trained models. To address these challenges, we propose Sample-centric In-context Learning (SAIL) for DIE. SAIL introduces a fine-grained entity-level textual similarity to facilitate in-depth text analysis by LLMs and incorporates layout similarity to enhance the analysis of layouts in VRDs. Additionally, SAIL formulates a unified In-Context Learning (ICL) prompt template for various sample-centric examples, enabling tailored prompts that deliver precise guidance to pre-trained models for each sample. Extensive experiments on FUNSD, CORD, and SROIE benchmarks with various base models (e.g., LLMs) indicate that our method outperforms training-free baselines, even closer to the full-training methods. The results show the superiority and generalization of our method.

SAIL: Sample-Centric In-Context Learning for Document Information Extraction

TL;DR

SAIL tackles document information extraction on Visually Rich Documents under a training-free regime by introducing a sample-centric ICL strategy that builds per-sample prompts from layout similarity, entity-level text similarity, and document-level similarity. It unifies a prompt template that guides LLMs through explicit layout-text analysis and diverse exemplars, achieving state-of-the-art performance among training-free approaches and approaching fully supervised methods on FUNSD, CORD, and SROIE across GPT-3.5, GPT-4o, and ChatGLM3. Key contributions include defining layout and entity-level similarities, a per-sample adaptive prompt construction, and comprehensive ablations confirming the effectiveness of each component. The approach yields strong generalization for DIE in VRDs and highlights the practical value of adaptive, sample-centric prompts in training-free information extraction pipelines.

Abstract

Document Information Extraction (DIE) aims to extract structured information from Visually Rich Documents (VRDs). Previous full-training approaches have demonstrated strong performance but may struggle with generalization to unseen data. In contrast, training-free methods leverage powerful pre-trained models like Large Language Models (LLMs) to address various downstream tasks with only a few examples. Nonetheless, training-free methods for DIE encounter two primary challenges: (1) understanding the complex relationship between layout and textual elements in VRDs, and (2) providing accurate guidance to pre-trained models. To address these challenges, we propose Sample-centric In-context Learning (SAIL) for DIE. SAIL introduces a fine-grained entity-level textual similarity to facilitate in-depth text analysis by LLMs and incorporates layout similarity to enhance the analysis of layouts in VRDs. Additionally, SAIL formulates a unified In-Context Learning (ICL) prompt template for various sample-centric examples, enabling tailored prompts that deliver precise guidance to pre-trained models for each sample. Extensive experiments on FUNSD, CORD, and SROIE benchmarks with various base models (e.g., LLMs) indicate that our method outperforms training-free baselines, even closer to the full-training methods. The results show the superiority and generalization of our method.

Paper Structure

This paper contains 21 sections, 9 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: For the (a) test sample from the CORD dataset CORD, our SAIL selects (b) layout similarity examples (grey marked), entity-level similarity examples (yellow marked), and document-level similarity examples (orange marked) to construct ICL prompts. (c) Benefiting from these examples, SAIL precisely extracts all information, while even the powerful GPT-4o gpt4o misidentifies three entities and incorrectly labels three entities.
  • Figure 2: Illustration of SAIL framework, including extracting texts $T$ and boxes $B$ from document images, encoding them separately, selecting textually similar entities, layout similar documents, and textually similar documents for each test sample, constructing sample-centric prompts using diverse examples, and generating predicted labels.
  • Figure 3: Illustration of layout similarity evaluation, including drawing boxes onto a blank image, cropping and resizing to form layout image, and comparing layout images.
  • Figure 4: Case study on performance comparison of (a) ICL-D3IE and (b) our SAIL. ICL-D3IE wrongly predicts the three green boxes on the left as "answer". In contrast, our proposed SAIL correctly predicts them as "question".
  • Figure 5: Ablation study of layout analysis. "w/o LA" means without adding layout analysis. Adding layout analysis achieves higher F1 scores across all three datasets.
  • ...and 12 more figures