Table of Contents
Fetching ...

Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel, Rodrigo Toro Icarte

TL;DR

This work describes how binding strategies vary across training regimes, visual encoders, and initializations, and shows that analogous shifts occur during pretrained LLM-to-VLM transitions, suggesting that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.

Abstract

Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.

Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

TL;DR

This work describes how binding strategies vary across training regimes, visual encoders, and initializations, and shows that analogous shifts occur during pretrained LLM-to-VLM transitions, suggesting that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.

Abstract

Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
Paper Structure (63 sections, 4 equations, 15 figures, 2 tables)

This paper contains 63 sections, 4 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Overview of the training pipeline, showing the progression from text-only training to vision–language training.
  • Figure 2: Binding accuracy comparison between text-only LLMs and their VLM counterparts across three model generations. VLMs consistently outperform LLMs as context length increases. The left panel shows performance on the simpler direct retrieval task, while the right panel shows the harder indirect retrieval task. Individual breakdown per model is provided in Appendix \ref{['app:qwen_performance']}.
  • Figure 3: Combined generalization performance across all models. Text-Only (Blue) suffers from sharp degradation on OOD lengths (37.2% average). Image-Text (Green) significantly improves generalization (69.5% average). Noise-Text (Orange) benefits from positional range expansion but remains insufficient (57.5% average). Noise-Image-Text (Red) achieves the strongest robustness (83.6% average). Individual plots are provided in Appendix \ref{['app:scratch_plots']}.
  • Figure 4: Binding Mechanism Shift. Interchange intervention results showing the dominant binding mechanism at each layer across all model variants. Text-only models ($\mathcal{M}_{\text{text-only}}$) rely on the positional mechanism (blue). Noise augmentation ($\mathcal{M}_{\text{noise-text}}$) introduces an increase in symbolic binding, but the model remains predominantly positional. Image-trained models ($\mathcal{M}_{\text{image-text}}$ and $\mathcal{M}_{\text{noise-image-text}}$) both transition to a symbolic mechanism (orange).
  • Figure 5: Circuits corresponding to the binding mechanisms. Early layers encode token-level information (color, shape, item identity). Colored arrows indicate attention-mediated information transfer between tokens. The key distinction: the Positional Circuit transfers only position indices (implicit binding), while Symbolic Circuits transfer semantic content (explicit binding).
  • ...and 10 more figures