Table of Contents
Fetching ...

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

Jinxu Zhang

TL;DR

This work tackles document understanding and reasoning in multimodal document images by generating high-quality, step-by-step Q&A data with large multimodal models and a filtering checker, then training a compact 2B Document Assistant (DocAssistant) to perform context-aware extraction and multi-hop reasoning. The approach combines template- and few-shot-based data generation, external OCR and chart tools, and rationales to guide learning, enabling interpretable stepwise reasoning rather than single-word answers. Empirical results on DocVQA, InfographicVQA, and ChartQA demonstrate state-of-the-art performance on complex layouts and reasoning tasks, with ablations confirming the value of extended data and the data-checker. The work highlights synthetic, step-wise data as a practical pathway to enhance document understanding capabilities in real-world, diverse documents and invites further exploration of multi-modal reasoning with compact models.

Abstract

Understanding the contents of multimodal documents is essential to accurately extract relevant evidence and use it for reasoning. Existing document understanding models tend to generate answers with a single word or phrase directly, ignoring the source document's evidence and lacking interpretability. In this work, we address the lack of step-wise capabilities through data augmentation and extension. Specifically, We use Multi-modal Large Language Models (MLLMs), which have strong visual understanding and reasoning abilities, as data generators to generate step-wise question-and-answer pairs for document images and use a high-performance LLM as the error detector to filter out noisy data. This step-wise data generation pipeline is implemented using both template-based and few-shot methods. We then use the generated high-quality data to train a humanized document understanding and reasoning model, specifically designed to solve complex questions that require reasoning or multi-hop question answering, dubbed DocAssistant. Experimental results demonstrate the effectiveness and application value of step-wise generation, showing a 5 improvement on InfoVQA with complex layouts and a 7 improvement on ChartQA with complex reasoning, compared to directly generated answers. We hope our work highlights the potential of synthetic data and encourages further exploration of multi-modal document reasoning capabilities.

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

TL;DR

This work tackles document understanding and reasoning in multimodal document images by generating high-quality, step-by-step Q&A data with large multimodal models and a filtering checker, then training a compact 2B Document Assistant (DocAssistant) to perform context-aware extraction and multi-hop reasoning. The approach combines template- and few-shot-based data generation, external OCR and chart tools, and rationales to guide learning, enabling interpretable stepwise reasoning rather than single-word answers. Empirical results on DocVQA, InfographicVQA, and ChartQA demonstrate state-of-the-art performance on complex layouts and reasoning tasks, with ablations confirming the value of extended data and the data-checker. The work highlights synthetic, step-wise data as a practical pathway to enhance document understanding capabilities in real-world, diverse documents and invites further exploration of multi-modal reasoning with compact models.

Abstract

Understanding the contents of multimodal documents is essential to accurately extract relevant evidence and use it for reasoning. Existing document understanding models tend to generate answers with a single word or phrase directly, ignoring the source document's evidence and lacking interpretability. In this work, we address the lack of step-wise capabilities through data augmentation and extension. Specifically, We use Multi-modal Large Language Models (MLLMs), which have strong visual understanding and reasoning abilities, as data generators to generate step-wise question-and-answer pairs for document images and use a high-performance LLM as the error detector to filter out noisy data. This step-wise data generation pipeline is implemented using both template-based and few-shot methods. We then use the generated high-quality data to train a humanized document understanding and reasoning model, specifically designed to solve complex questions that require reasoning or multi-hop question answering, dubbed DocAssistant. Experimental results demonstrate the effectiveness and application value of step-wise generation, showing a 5 improvement on InfoVQA with complex layouts and a 7 improvement on ChartQA with complex reasoning, compared to directly generated answers. We hope our work highlights the potential of synthetic data and encourages further exploration of multi-modal document reasoning capabilities.
Paper Structure (32 sections, 8 equations, 7 figures, 13 tables)

This paper contains 32 sections, 8 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Existing document visual question answering models tend to generate a word or phrase directly as an answer, ignoring the evidence or reasoning steps of the source. We generate high-quality data with intermediate results using a large-scale multi-modal model by constructing templates and employing a few-shot approach. These augmented and extended data are then used to enhance a small-scale multi-modal model, achieving an efficient and general step-wise document understanding and reasoning model.
  • Figure 2: Data checker based on multi-agent interaction. A 26B MLLM is used to generate answers and rationale with relevant context information for the corresponding text extracted document and chart reasoning document, according to the input templates or exemplars. Dotted arrows indicate the extended data including questions. The data checker uses OCR text from ordinary documents and tabular information equivalent to charts. First, it checks for extraction errors in the generated data. Second, it checks for errors in the intermediate steps of reasoning. If any errors are found, the data is considered unqualified.
  • Figure 3: Model overview. Document images, projected by projection layers concatenated with a prompt (question) and an optional OCR text or table, are fed into the language model for step-wise generation. Answer generation for the extractive and abstractive types consists of two steps: the first step generates the context relevant to the keywords in the question, and the second step generates the corresponding answer based on the context. For the reasoning type of answer generation, the steps depend on the question type and vary with the complexity of the question, using exemplars as the prompt.
  • Figure 4: Output comparison of DocAssistant and other models on three datasets, with red font representing rationales relevant to the question.
  • Figure 5: Analysis of different types of questions in three test sets.
  • ...and 2 more figures