InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki
TL;DR
This paper tackles zero-shot generalization in visual document understanding by introducing InstructDoc, a large-scale instruction-tuning dataset spanning 12 VDU tasks from 30 public datasets, structured with a unified instruction schema. It also presents InstructDr, a bridging model that connects document features to a language model through the Document-former, enabling multi-modal, instruction-guided reasoning across multi-page documents. Empirical results show InstructDr achieves state-of-the-art zero-shot performance among multimodal LLMs and often surpasses ChatGPT on several VDU benchmarks, with robust instruction-variant performance and strong transfer during task-specific fine-tuning. The work demonstrates the potential of instruction-driven generalization for diverse document types and tasks, while acknowledging OCR noise and instruction diversity as key future directions for improvement.
Abstract
We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.
