Table of Contents
Fetching ...

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai

TL;DR

A self-distillation method is proposed that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and suggesting a practical path toward improving visual text understanding in multimodal language models.

Abstract

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

TL;DR

A self-distillation method is proposed that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and suggesting a practical path toward improving visual text understanding in multimodal language models.

Abstract

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.
Paper Structure (39 sections, 14 figures, 6 tables)

This paper contains 39 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Humans read text through vision from diverse sources such as books, webpages, and documents, whereas MLLMs may yield different predictions when the same textual content is presented in different input forms.
  • Figure 2: Diagnosing the modality gap in visual text understanding. We evaluate MLLMs across five input modes, including pure text, rendered text images, real-world visual text, and two OCR-based diagnostic settings (OCR-1P and OCR-2P). Our error taxonomy reveals that while image modality amplifies reading and calculation errors, it leaves underlying reasoning capabilities largely intact but also reduces the likelihood of triggering a reasoning chain. We further propose remedies to bridge the performance gap, including rendering specification control, resolution-aware preprocessing, and LM-only self-distillation.
  • Figure 3: Performance difference between Pure Text and Pure Image with different renderings. Each point is a dataset--model pair. Handwriting consistently causes larger negative drops than all the other settings.
  • Figure 4: The performance with respect to the image resolution scale on HumanEval (top) and ARC (bottom). The black vertical dashed line indicates the point where the Pure Image consumes the same FLOPs as the Pure Text mode. Most models remain a stable performance until the resolution reaches a certain lower bound, while InternVL-3.5-8B maintains stable performance across all resolutions.
  • Figure 5: Distribution of error categories across input modes (left) and datasets (right). Cell values show raw counts and within-column percentages.
  • ...and 9 more figures