Table of Contents
Fetching ...

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda

TL;DR

The paper identifies a two-hop bottleneck in multimodal knowledge retrieval: VLMs must first form robust visual entity representations before they can tap into the LLM backbone's factual recall. Through a large-scale benchmark across 14 VLMs, they show widespread factual-recall degradation, especially in adapter-based and cross-attention models. Using attribution patching, activation patching, and probing, they reveal that degraded VLMs delay entity resolution and bypass early recall circuits, while well-aligned models engage these circuits earlier. They demonstrate partial recovery via patching LLM outputs into VLMs and by chain-of-thought prompting, highlighting the potential of reasoning-based mitigation and the critical role of early visual-to-text integration for robust multimodal reasoning.

Abstract

Training vision language models (VLMs) aims to align visual representations from a vision encoder with the textual representations of a pretrained large language model (LLM). However, many VLMs exhibit reduced factual recall performance compared to their LLM backbones, raising the question of how effective multimodal fine-tuning is at extending existing mechanisms within the LLM to visual inputs. We argue that factual recall based on visual inputs requires VLMs to solve a two-hop problem: (1) forming entity representations from visual inputs, and (2) recalling associated factual knowledge based on these entity representations. By benchmarking 14 VLMs with various architectures (LLaVA, Native, Cross-Attention), sizes (7B-124B parameters), and training setups on factual recall tasks against their original LLM backbone models, we find that 11 of 14 models exhibit factual recall degradation. We select three models with high and two models with low performance degradation, and use attribution patching, activation patching, and probing to show that degraded VLMs struggle to use the existing factual recall circuit of their LLM backbone, because they resolve the first hop too late in the computation. In contrast, high-performing VLMs resolve entity representations early enough to reuse the existing factual recall mechanism. Finally, we demonstrate two methods to recover performance: patching entity representations from the LLM backbone into the VLM, and prompting with chain-of-thought reasoning. Our results highlight that the speed of early entity resolution critically determines how effective VLMs are in using preexisting LLM mechanisms. More broadly, our work illustrates how mechanistic analysis can explain and unveil systematic failures in multimodal alignment.

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

TL;DR

The paper identifies a two-hop bottleneck in multimodal knowledge retrieval: VLMs must first form robust visual entity representations before they can tap into the LLM backbone's factual recall. Through a large-scale benchmark across 14 VLMs, they show widespread factual-recall degradation, especially in adapter-based and cross-attention models. Using attribution patching, activation patching, and probing, they reveal that degraded VLMs delay entity resolution and bypass early recall circuits, while well-aligned models engage these circuits earlier. They demonstrate partial recovery via patching LLM outputs into VLMs and by chain-of-thought prompting, highlighting the potential of reasoning-based mitigation and the critical role of early visual-to-text integration for robust multimodal reasoning.

Abstract

Training vision language models (VLMs) aims to align visual representations from a vision encoder with the textual representations of a pretrained large language model (LLM). However, many VLMs exhibit reduced factual recall performance compared to their LLM backbones, raising the question of how effective multimodal fine-tuning is at extending existing mechanisms within the LLM to visual inputs. We argue that factual recall based on visual inputs requires VLMs to solve a two-hop problem: (1) forming entity representations from visual inputs, and (2) recalling associated factual knowledge based on these entity representations. By benchmarking 14 VLMs with various architectures (LLaVA, Native, Cross-Attention), sizes (7B-124B parameters), and training setups on factual recall tasks against their original LLM backbone models, we find that 11 of 14 models exhibit factual recall degradation. We select three models with high and two models with low performance degradation, and use attribution patching, activation patching, and probing to show that degraded VLMs struggle to use the existing factual recall circuit of their LLM backbone, because they resolve the first hop too late in the computation. In contrast, high-performing VLMs resolve entity representations early enough to reuse the existing factual recall mechanism. Finally, we demonstrate two methods to recover performance: patching entity representations from the LLM backbone into the VLM, and prompting with chain-of-thought reasoning. Our results highlight that the speed of early entity resolution critically determines how effective VLMs are in using preexisting LLM mechanisms. More broadly, our work illustrates how mechanistic analysis can explain and unveil systematic failures in multimodal alignment.

Paper Structure

This paper contains 37 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Factual Recall in VLMs vs. LLMs – Illustration of the Two-Hop Problem. This figure compares how a Vision Language Model (VLM, left) and a text-only Language Model (LLM, right) perform a factual recall task. The VLM receives an image of the Perito Moreno Glacier and a question: “In which country is the entity located?” The image is processed by a Vision Transformer (ViT), producing distributed visual embeddings that do not align with the LLM’s pretrained token space. The entity (“Perito Moreno Glacier”) is only recognized in the middle layers, bypassing the early-layer MLPs responsible for factual recall and resulting in an incorrect answer. In contrast, the LLM is given the full question, with the subject tokens “Perito Moreno Glacier” available. This enables early-layer MLPs to access factual knowledge (“Argentina”) and produce the correct answer. The comparison highlights the core issue: VLMs must first infer the subject before retrieving facts, but because recognition occurs late, it cannot engage early factual recall mechanisms. This “two-hop” problem leads to degraded factual accuracy in VLMs, even when visual recognition succeeds.
  • Figure 2: Attribution scores of each MLP and Attention sublayer for the original LLM backbone models. Higher values indicate higher causal relevance for factual recall.
  • Figure 3: Attribution scores of each MLP and Attention sublayer for the VLM models. Higher values indicate higher causal relevance for factual recall.
  • Figure 4: Factual recall performance recovery for LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-MORE models when MLP outputs across entity tokens from corresponding LLM backbones are patched into early VLM layers. We test only on examples where the VLM was originally wrong, so the y-axis directly shows the recovered performance gap. We compare our heuristic patching approach against a random baseline (randomly selecting patching positions), and back-patching.
  • Figure 5: Accuracy of linear probes trained on residual-stream representations at each transformer layer of LLaVA-1.5-7B, LLaVA-1.5-13B, LLaVA-MORE, Gemma-3-12B, and Qwen2.5-VL-7B measured on ImageNet-100 entity prediction. The three LLaVA-style models exhibit a consistent pattern: probe accuracy remains poor in early layers and rises sharply between middle-to-late layers. Gemma-3-12B and Qwen2.5-VL-7B on the other hand show consistently high probe accuracies.
  • ...and 6 more figures