Table of Contents
Fetching ...

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki

TL;DR

This paper tackles zero-shot generalization in visual document understanding by introducing InstructDoc, a large-scale instruction-tuning dataset spanning 12 VDU tasks from 30 public datasets, structured with a unified instruction schema. It also presents InstructDr, a bridging model that connects document features to a language model through the Document-former, enabling multi-modal, instruction-guided reasoning across multi-page documents. Empirical results show InstructDr achieves state-of-the-art zero-shot performance among multimodal LLMs and often surpasses ChatGPT on several VDU benchmarks, with robust instruction-variant performance and strong transfer during task-specific fine-tuning. The work demonstrates the potential of instruction-driven generalization for diverse document types and tasks, while acknowledging OCR noise and instruction diversity as key future directions for improvement.

Abstract

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

TL;DR

This paper tackles zero-shot generalization in visual document understanding by introducing InstructDoc, a large-scale instruction-tuning dataset spanning 12 VDU tasks from 30 public datasets, structured with a unified instruction schema. It also presents InstructDr, a bridging model that connects document features to a language model through the Document-former, enabling multi-modal, instruction-guided reasoning across multi-page documents. Empirical results show InstructDr achieves state-of-the-art zero-shot performance among multimodal LLMs and often surpasses ChatGPT on several VDU benchmarks, with robust instruction-variant performance and strong transfer during task-specific fine-tuning. The work demonstrates the potential of instruction-driven generalization for diverse document types and tasks, while acknowledging OCR noise and instruction diversity as key future directions for improvement.

Abstract

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.
Paper Structure (59 sections, 9 figures, 18 tables)

This paper contains 59 sections, 9 figures, 18 tables.

Figures (9)

  • Figure 1: Examples of InstructDoc dataset. The input defines fooblue!30 intent, foogreen!20 query and options, and foomagenta!30 answer style. foogreen!20 query and options and outputs are from original datasets. We annotated instructions composed of fooblue!30 intent and foomagenta!30 answer style or only fooblue!30 intent.
  • Figure 2: Datasets used in InstructDoc. InstructDoc covers a wide range of VDU tasks and document types and formats.
  • Figure 3: InstructDr model. We update only the parameters of Document-former and the projection FFN layer during training.
  • Figure 4: Comparison of zero-shot performance on DUDE for five different instructions. w/o Multiple instructions denotes our model trained with a single instruction per dataset.
  • Figure 5: Model performance as the number of task clusters used in training. ($\cdot$) denotes the number of tasks.
  • ...and 4 more figures