DocVLM: Make Your VLM an Efficient Reader

Mor Shpigel Nacson; Aviad Aberdam; Roy Ganz; Elad Ben Avraham; Alona Golts; Yair Kittenplon; Shai Mazor; Ron Litman

DocVLM: Make Your VLM an Efficient Reader

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, Ron Litman

TL;DR

DocVLM tackles the challenge of efficient document understanding in vision-language models by integrating an OCR-based modality and compressing OCR information into a compact set of 64 learned queries. The approach preserves base VLM weights while feeding compressed OCR representations into the LLM, enabling effective performance even with restricted token budgets and high-resolution requirements. Evaluations across LLaVA-OneVision, InternVL2, and Qwen2-VL demonstrate consistent improvements in token-constrained regimes, plus strong multipage capabilities with zero-shot MP-DocVQA and competitive DUDE results. The work also provides a detailed ablation study showing the benefits of the compression strategy, two-stage training, and multipage encoding choices, establishing DocVLM as a practical, model-agnostic solution for efficient document reading in real-world applications.

Abstract

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance. We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM significantly reduces reliance on high-resolution images for document understanding. In limited-token regimes (448$\times$448), DocVLM with 64 learned queries improves DocVQA results from 56.0% to 86.6% when integrated with InternVL2 and from 84.4% to 91.2% with Qwen2-VL. In LLaVA-OneVision, DocVLM achieves improved results while using 80% less image tokens. The reduced token usage allows processing multiple pages effectively, showing impressive zero-shot results on DUDE and state-of-the-art performance on MP-DocVQA, highlighting DocVLM's potential for applications requiring high-performance and efficiency.

DocVLM: Make Your VLM an Efficient Reader

TL;DR

Abstract

DocVLM: Make Your VLM an Efficient Reader

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)