Table of Contents
Fetching ...

DocVLM: Make Your VLM an Efficient Reader

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, Ron Litman

TL;DR

DocVLM tackles the challenge of efficient document understanding in vision-language models by integrating an OCR-based modality and compressing OCR information into a compact set of 64 learned queries. The approach preserves base VLM weights while feeding compressed OCR representations into the LLM, enabling effective performance even with restricted token budgets and high-resolution requirements. Evaluations across LLaVA-OneVision, InternVL2, and Qwen2-VL demonstrate consistent improvements in token-constrained regimes, plus strong multipage capabilities with zero-shot MP-DocVQA and competitive DUDE results. The work also provides a detailed ablation study showing the benefits of the compression strategy, two-stage training, and multipage encoding choices, establishing DocVLM as a practical, model-agnostic solution for efficient document reading in real-world applications.

Abstract

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance. We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM significantly reduces reliance on high-resolution images for document understanding. In limited-token regimes (448$\times$448), DocVLM with 64 learned queries improves DocVQA results from 56.0% to 86.6% when integrated with InternVL2 and from 84.4% to 91.2% with Qwen2-VL. In LLaVA-OneVision, DocVLM achieves improved results while using 80% less image tokens. The reduced token usage allows processing multiple pages effectively, showing impressive zero-shot results on DUDE and state-of-the-art performance on MP-DocVQA, highlighting DocVLM's potential for applications requiring high-performance and efficiency.

DocVLM: Make Your VLM an Efficient Reader

TL;DR

DocVLM tackles the challenge of efficient document understanding in vision-language models by integrating an OCR-based modality and compressing OCR information into a compact set of 64 learned queries. The approach preserves base VLM weights while feeding compressed OCR representations into the LLM, enabling effective performance even with restricted token budgets and high-resolution requirements. Evaluations across LLaVA-OneVision, InternVL2, and Qwen2-VL demonstrate consistent improvements in token-constrained regimes, plus strong multipage capabilities with zero-shot MP-DocVQA and competitive DUDE results. The work also provides a detailed ablation study showing the benefits of the compression strategy, two-stage training, and multipage encoding choices, establishing DocVLM as a practical, model-agnostic solution for efficient document reading in real-world applications.

Abstract

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance. We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM significantly reduces reliance on high-resolution images for document understanding. In limited-token regimes (448448), DocVLM with 64 learned queries improves DocVQA results from 56.0% to 86.6% when integrated with InternVL2 and from 84.4% to 91.2% with Qwen2-VL. In LLaVA-OneVision, DocVLM achieves improved results while using 80% less image tokens. The reduced token usage allows processing multiple pages effectively, showing impressive zero-shot results on DUDE and state-of-the-art performance on MP-DocVQA, highlighting DocVLM's potential for applications requiring high-performance and efficiency.

Paper Structure

This paper contains 32 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: DocVLM enhances VLMs' reading capabilities. Integrating DocVLM (solid lines) in top-performing VLMs (dashed lines) consistently improves the performance across all token budgets, frequently surpassing the baseline at higher token counts.
  • Figure 2: DocVLM Architecture. DocVLM enhances document understanding in frozen VLMs by integrating an OCR module with a query compression mechanism. By condensing OCR data into $\text{M}=64$ learnable tokens, DocVLM effectively complements visual information, surpassing the VLM’s inherent approaches of increasing image resolution or visual feature dimensions.
  • Figure 3: Qualitative Results. Representative examples of DocVLM's performance across diverse document formats, from dense text to infographics and scene text. Our model successfully handles complex layouts, dense content, and presents instruction-following capabilities without explicit training on such datasets. Each example includes an image-instruction pair with baseline and DocVLM predictions.
  • Figure 4: Balancing Performance and Compute. Analysis of model performance (lines, left y-axis) and token usage (bars, right y-axis) as a function of visual token allocation. Each model employs its inherent token control strategy: AnyRes max for feature downsampling (LLaVA One-Vision), dynamic max tiles (InternVL2), and max image tokens for resolution control (Qwen2-VL). The results highlight that DocVLM consistently improves performance with minimal overhead (64 tokens), offering an efficient OCR-visual token allocation.
  • Figure 5: Compression Levels. DocVQA validation results for DocVLM integrated with Qwen2-VL across varying OCR and image token budgets. "0" represents the baseline, while "Full" indicates uncompressed encodings.
  • ...and 2 more figures