Table of Contents
Fetching ...

DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

Sungnyun Kim, Haofu Liao, Srikar Appalaraju, Peng Tang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan, Stefano Soatto

TL;DR

This study presents a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge and produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge.

Abstract

Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data are not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.

DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

TL;DR

This study presents a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge and produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge.

Abstract

Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data are not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.
Paper Structure (61 sections, 4 equations, 14 figures, 19 tables)

This paper contains 61 sections, 4 equations, 14 figures, 19 tables.

Figures (14)

  • Figure 1: We leverage LLM to generate document annotations given the text extracted from a document image.
  • Figure 2: Overview of DocKD. (a) To prepare training data, we provide an LLM teacher with a generation prompt $\mathbf{p}_\text{gen}$ given the document text. LLM generates answers $\mathbf{a}_\text{gen}$ which are then converted into ($\mathbf{p}_\text{task}, \mathbf{a}_\text{task}$). We explore methods to inject external document knowledge ($\mathrel{ {$$} {$$} {$$} \@whiledim<{ } \mathrel{} \mathrel{ \mathop{\dabar@\dabar@}\limits } \mathrel{\mathchar"0\hexnumber@\symAMSa 4B }{} }$ ) into the document text or $\mathbf{p}_\text{gen}$ to obtain high-quality annotations. (b) We train a student VDU model using the generated task prompt and answer pairs ($\mathbf{p}_\text{task},\mathbf{a}_\text{task}$).
  • Figure 3: (a) When the input document text is in its raw OCR form, LLM produces simply extracted QA pairs. (b) When provided with linearized OCR text processed by a linearization model, LLM generates QA pairs that require visual layout knowledge to solve.
  • Figure 4: The templates on the left serve as input prompts to the LLM, for (a) generating non-KV entities and (b) naming KV entities, respectively. For (b), in the iteration $n$, the $n$-th KV entity is provided as input as well as the output from the previous iteration. On the right, we show the result of generated entities and field names, with blue boxes representing non-KV entities and red boxes representing KV entities.
  • Figure 5: Top-10 frequently generated document class labels from IDL idl.
  • ...and 9 more figures