Table of Contents
Fetching ...

Efficient End-to-End Visual Document Understanding with Rationale Distillation

Wang Zhu, Alekh Agarwal, Mandar Joshi, Robin Jia, Jesse Thomason, Kristina Toutanova

TL;DR

This work tackles the inefficiency of tool-heavy visual document understanding by teaching a compact image-to-text model to reason through short, tool-generated rationales. By distilling rationales from OCR, charts, and LLMs into a small student model and training it with multi-task objectives, the approach achieves substantial accuracy gains on InfoVQA, DocVQA, and ChartQA with minimal inference overhead. Key contributions include a robust augmentation and filtering pipeline, rationale-based multi-task training (QRA, ASR, QRACI, ALRCI), and a detailed analysis of rationale types, robustness, and efficiency. The results demonstrate that RD can match or exceed stronger baselines while dramatically reducing engineering complexity and computation compared to full tool pipelines, with strong potential for extension to broader document and multimodal tasks.

Abstract

Understanding visually situated language requires interpreting complex layouts of textual and visual elements. Pre-processing tools, such as optical character recognition (OCR), can map document image inputs to textual tokens, then large language models (LLMs) can reason over text. However, such methods have high computational and engineering complexity. Can small pretrained image-to-text models accurately understand visual documents through similar recognition and reasoning steps instead? We propose Rationale Distillation (RD), which incorporates the outputs of OCR tools, LLMs, and larger multimodal models as intermediate "rationales", and trains a small student model to predict both rationales and answers. On three visual document understanding benchmarks representing infographics, scanned documents, and figures, our Pix2Struct (282M parameters) student model finetuned with RD outperforms the base model by 4-5% absolute accuracy with only 1% higher computational cost.

Efficient End-to-End Visual Document Understanding with Rationale Distillation

TL;DR

This work tackles the inefficiency of tool-heavy visual document understanding by teaching a compact image-to-text model to reason through short, tool-generated rationales. By distilling rationales from OCR, charts, and LLMs into a small student model and training it with multi-task objectives, the approach achieves substantial accuracy gains on InfoVQA, DocVQA, and ChartQA with minimal inference overhead. Key contributions include a robust augmentation and filtering pipeline, rationale-based multi-task training (QRA, ASR, QRACI, ALRCI), and a detailed analysis of rationale types, robustness, and efficiency. The results demonstrate that RD can match or exceed stronger baselines while dramatically reducing engineering complexity and computation compared to full tool pipelines, with strong potential for extension to broader document and multimodal tasks.

Abstract

Understanding visually situated language requires interpreting complex layouts of textual and visual elements. Pre-processing tools, such as optical character recognition (OCR), can map document image inputs to textual tokens, then large language models (LLMs) can reason over text. However, such methods have high computational and engineering complexity. Can small pretrained image-to-text models accurately understand visual documents through similar recognition and reasoning steps instead? We propose Rationale Distillation (RD), which incorporates the outputs of OCR tools, LLMs, and larger multimodal models as intermediate "rationales", and trains a small student model to predict both rationales and answers. On three visual document understanding benchmarks representing infographics, scanned documents, and figures, our Pix2Struct (282M parameters) student model finetuned with RD outperforms the base model by 4-5% absolute accuracy with only 1% higher computational cost.
Paper Structure (63 sections, 5 equations, 8 figures, 11 tables, 2 algorithms)

This paper contains 63 sections, 5 equations, 8 figures, 11 tables, 2 algorithms.

Figures (8)

  • Figure 1: We synthesise the ability of recognizing and summarizing text, deplotting structured plots, and program generation into one small model and perform efficient rationale-based visual document understanding.
  • Figure 2: For training examples, we first generate the full OCR of each image with Google Cloud OCR. Depending on the dataset, we either use LLM-Summarizer (few-shot prompted PaLM 2-L) to generate text evidence (top), or use LLM-Programmer (also PaLM 2-L) to generate a program based on both the OCR and available structured table source for the image (bottom).
  • Figure 3: We first crop along the longer edge of the image to create multiple smaller square images. We generate rationales using the appropriate subset of tools (OCR, LLM-Summarizer, LLM-Programmer, Plot-to-Table) on these images, then categorize the examples and rationales with Multimodal-Verifier (PaLI-X).
  • Figure 4: We analyze the usefulness of student generated rationales. The systems are shown in order of increasing engineering complexity. All red bars use a pipeline with Google Cloud OCR during inference. RD trades off between accuracy and complexity.
  • Figure 5: The distribution of the log-likelihood of PaLI-X prediction; $y$-axis is the number of examples. Orange bars show correct predictions and blue bars show the wrong predictions.
  • ...and 3 more figures