Table of Contents
Fetching ...

ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction

Jiabang He, Lei Wang, Yi Hu, Ning Liu, Hui Liu, Xing Xu, Heng Tao Shen

TL;DR

The paper tackles document information extraction from visually rich documents by reframing it as an in-context learning problem for large language models. It introduces ICL-D3IE, a framework that constructs diverse demonstrations—hard, layout-aware, and formatting—and iteratively updates them using nearest-neighbor training documents to guide inference. The approach yields state-of-the-art or competitive results on FUNSD and SROIE in in-domain settings and demonstrates strong robustness in out-of-domain scenarios across FUNSD, CORD, and SROIE, without finetuning. Overall, it showcases the potential of LLM-based in-context learning for visually rich document understanding, highlighting the importance of prompt design and demonstration structure.

Abstract

Large language models (LLMs), such as GPT-3 and ChatGPT, have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning, which involves inference based on a few demonstration examples. Despite their successes in NLP tasks, no investigation has been conducted to assess the ability of LLMs to perform document information extraction (DIE) using in-context learning. Applying LLMs to DIE poses two challenges: the modality and task gap. To this end, we propose a simple but effective in-context learning framework called ICL-D3IE, which enables LLMs to perform DIE with different types of demonstration examples. Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations for benefiting all test instances. We design demonstrations describing relationships that enable LLMs to understand positional relationships. We introduce formatting demonstrations for easy answer extraction. Additionally, the framework improves diverse demonstrations by updating them iteratively. Our experiments on three widely used benchmark datasets demonstrate that the ICL-D3IE framework enables Davinci-003/ChatGPT to achieve superior performance when compared to previous pre-trained methods fine-tuned with full training in both the in-distribution (ID) setting and in the out-of-distribution (OOD) setting. Code is available at https://github.com/MAEHCM/ICL-D3IE.

ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction

TL;DR

The paper tackles document information extraction from visually rich documents by reframing it as an in-context learning problem for large language models. It introduces ICL-D3IE, a framework that constructs diverse demonstrations—hard, layout-aware, and formatting—and iteratively updates them using nearest-neighbor training documents to guide inference. The approach yields state-of-the-art or competitive results on FUNSD and SROIE in in-domain settings and demonstrates strong robustness in out-of-domain scenarios across FUNSD, CORD, and SROIE, without finetuning. Overall, it showcases the potential of LLM-based in-context learning for visually rich document understanding, highlighting the importance of prompt design and demonstration structure.

Abstract

Large language models (LLMs), such as GPT-3 and ChatGPT, have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning, which involves inference based on a few demonstration examples. Despite their successes in NLP tasks, no investigation has been conducted to assess the ability of LLMs to perform document information extraction (DIE) using in-context learning. Applying LLMs to DIE poses two challenges: the modality and task gap. To this end, we propose a simple but effective in-context learning framework called ICL-D3IE, which enables LLMs to perform DIE with different types of demonstration examples. Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations for benefiting all test instances. We design demonstrations describing relationships that enable LLMs to understand positional relationships. We introduce formatting demonstrations for easy answer extraction. Additionally, the framework improves diverse demonstrations by updating them iteratively. Our experiments on three widely used benchmark datasets demonstrate that the ICL-D3IE framework enables Davinci-003/ChatGPT to achieve superior performance when compared to previous pre-trained methods fine-tuned with full training in both the in-distribution (ID) setting and in the out-of-distribution (OOD) setting. Code is available at https://github.com/MAEHCM/ICL-D3IE.
Paper Structure (18 sections, 6 equations, 16 figures, 13 tables)

This paper contains 18 sections, 6 equations, 16 figures, 13 tables.

Figures (16)

  • Figure 1: Two approaches for solving the DIE task: (a) previous pre-trained document understanding models huang2022layoutlmv3xu-etal-2021-layoutlmv2 fine-tuned with full training examples, and (b) in-context learning over LLMs with a few examples.
  • Figure 2: A detailed illustration of ICL-D3IE framework, including obtaining nearest neighbor documents for test samples from the training dataset, constructing iteratively updated diverse demonstrations, and performing inference.
  • Figure 3: Example of the input and output of in-context learning with diverse demonstrations. The text highlighted in blue is not processed by LLMs, while the text highlighted in red is fed into LLMs. The green-highlighted text represents the output of LLMs. The text in red represents the prediction made by the LLM. The final prompt comprises label mapping, hard demonstrations, layout-aware demonstrations, formatting demonstrations, and a question prompt of "What are the labels for these texts?".
  • Figure 4: Further analysis on (a) the effect of the number of different demonstrations on CORD and (b) the effect of the number of hard demonstrations updating.
  • Figure 5: Further analysis on (a) the performance effect of arranging demonstrations in a different order and (b) the performance comparison of increasing the number of demonstrations on ICL-D3IE (Davinci-003/ChatGPT) and LayoutLMv3 on CORD.
  • ...and 11 more figures