Table of Contents
Fetching ...

Towards Understanding Multimodal Fine-Tuning: Spatial Features

Lachin Naghashyar, Hunar Batra, Ashkan Khakzar, Philip Torr, Ronald Clark, Christian Schroeder de Witt, Constantin Venhoff

TL;DR

Vision-language models gain performance through multimodal fine-tuning, but how language backbones adapt to visual grounding remains unclear. The authors extend stage-wise model diffing to multimodal settings, using sparse autoencoders to track feature-level shifts in a LLaMA-based backbone trained on VQAv2, revealing how vision grounding reshapes language representations. They identify vision-preferring features that rotate during training, isolate a compact subset that encodes spatial relations, and causally attribute them to a small group of mid-layer attention heads. The work provides a mechanistic, interpretable framework for auditing multimodal training and guiding targeted fine-tuning to refine vision-grounded capabilities.

Abstract

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

Towards Understanding Multimodal Fine-Tuning: Spatial Features

TL;DR

Vision-language models gain performance through multimodal fine-tuning, but how language backbones adapt to visual grounding remains unclear. The authors extend stage-wise model diffing to multimodal settings, using sparse autoencoders to track feature-level shifts in a LLaMA-based backbone trained on VQAv2, revealing how vision grounding reshapes language representations. They identify vision-preferring features that rotate during training, isolate a compact subset that encodes spatial relations, and causally attribute them to a small group of mid-layer attention heads. The work provides a mechanistic, interpretable framework for auditing multimodal training and guiding targeted fine-tuning to refine vision-grounded capabilities.

Abstract

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.
Paper Structure (43 sections, 8 equations, 17 figures, 2 tables)

This paper contains 43 sections, 8 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: SAE adaptation on LLaVA-MORE. Top: Mean fraction of variance unexplained (FVU) across layers on the validation set. Bottom: Summary statistics of FVU values on the validation set, with decimal alignment; the lowest mean is highlighted in bold.
  • Figure 2: Distribution of SAE features by visual energy and cosine similarity. All features are shown in gray; adapted features are highlighted in pink. Spatial candidates are marked with blue squares, and the subset used for downstream analysis is shown as red crosses.
  • Figure 3: Auto-Interp example (Layer 16, Feature 176). Top VQA and VSR samples both highlight facing direction, with activation on objects described as facing toward, away, or relative to another.
  • Figure 4: Attribution patching across related spatial features. Top: recurring top-scoring head (L13H1) localizes to relevant regions in queries about “on top of” relations. Middle: bottom-ranked heads on the same samples fail to capture spatial structure. Bottom: unrelated queries confirm that the top head does not spuriously activate.
  • Figure 5: Decoder cosine similarity vs. layer (LLM SAE vs. VLM SAE). Text-only stays highly aligned across layers; image-only and full-sequence rotate in shallow layers and align later; random remains near zero. Higher cosine indicates closer alignment of SAE decoder directions.
  • ...and 12 more figures