Table of Contents
Fetching ...

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg, Oren Gal

TL;DR

Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

Abstract

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

TL;DR

Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

Abstract

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.
Paper Structure (47 sections, 2 equations, 5 figures, 11 tables)

This paper contains 47 sections, 2 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Causal intervention on the VLM residual stream. By identifying the OCR direction via PCA on activation deltas (left) and projecting it out at the bottleneck (center), OCR capability is effectively suppressed while spatial reasoning and counting tasks are functionally separated and preserved (right).
  • Figure 2: Example original-inpainted pairs from EgoTextVQA. Left: original images with naturally-occurring text. Right: inpainted versions with text regions removed. Activation differences between pairs capture the "OCR signal."
  • Figure 3: Small multiples view: unnormalized (absolute layer) comparison across all models. Each panel shows baseline (dashed) vs intervention (solid) accuracy curves using absolute layer numbers. Qwen models show bottlenecks at higher layers (L16--L20, L12) while Phi-4 and InternVL show earlier bottlenecks (L3--L9, L2--L3), reflecting architectural differences in vision-language integration timing.
  • Figure 4: Cross-dataset generalization of OCR suppression. Each panel shows one model's accuracy delta across layers for all three datasets. PCA directions learned on EgoTextVQA (green) transfer effectively to OCRBench (blue) and InfoVQA (red), demonstrating that learned OCR representations generalize beyond the training distribution.
  • Figure 5: OCR bottleneck location by architecture and dataset. Each panel shows mean accuracy delta across three network depth stages (Early 0--33%, Mid 33--66%, Late 66--100%) for one dataset. EgoTextVQA and OCRBench show consistent patterns: InternVL3.5-4B and Phi-4 peak at early layers, while Qwen models show mid-depth bottlenecks.