Table of Contents
Fetching ...

ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models

Benjamin Clavié, Florian Brand

TL;DR

ReadBench addresses the gap in evaluating vision-language models' ability to read and reason about visually presented text, especially in long documents. By converting contexts from text-only benchmarks into images while preserving prompts, it creates a realistic multimodal reading scenario. Across nine VLMs, the study finds universal degradation in multimodal reading, with longer contexts driving larger performance drops and resolution having negligible impact. The work reveals model-specific weaknesses, provides a scalable evaluation protocol, and offers resources for future benchmarking and improvements in text-rich visual understanding.

Abstract

Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .

ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models

TL;DR

ReadBench addresses the gap in evaluating vision-language models' ability to read and reason about visually presented text, especially in long documents. By converting contexts from text-only benchmarks into images while preserving prompts, it creates a realistic multimodal reading scenario. Across nine VLMs, the study finds universal degradation in multimodal reading, with longer contexts driving larger performance drops and resolution having negligible impact. The work reveals model-specific weaknesses, provides a scalable evaluation protocol, and offers resources for future benchmarking and improvements in text-rich visual understanding.

Abstract

Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .

Paper Structure

This paper contains 23 sections, 4 figures.

Figures (4)

  • Figure 1: An example MMLU-Redux converted input
  • Figure 2: Gemini 2.0 Flash multi-modal scores across a range of PPI settings (resulting in different resolutions).
  • Figure 3: Performance degradation overview across datasets for all models
  • Figure 4: Consistency of multimodal–text disagreements across models.