Table of Contents
Fetching ...

Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov

TL;DR

This work investigates why Vision-Language Models underperform on visual analogs of textual tasks by analyzing modality-specific model circuits. It defines circuits as minimal subgraphs and uses causal attribution patching to discover and evaluate them, revealing largely disjoint data-processing components across vision and language, with query and generation components largely functionally equivalent. The authors demonstrate that data processing differences drive the accuracy gap, while query/answer processing remain shared, enabling a test-time intervention called back-patching that injects deeper, text-aligned visual representations into earlier layers. Across three VLMs and five tasks, back-patching yields an average accuracy boost of about 4.6 percentage points and closes roughly 32% of the visual-textual performance gap, suggesting a training-free path to improved multi-modal performance. The findings emphasize the value of understanding modality-specific circuits and point to targeted inference-time modifications as a practical route to reducing cross-modal gaps in VLMs.

Abstract

Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

TL;DR

This work investigates why Vision-Language Models underperform on visual analogs of textual tasks by analyzing modality-specific model circuits. It defines circuits as minimal subgraphs and uses causal attribution patching to discover and evaluate them, revealing largely disjoint data-processing components across vision and language, with query and generation components largely functionally equivalent. The authors demonstrate that data processing differences drive the accuracy gap, while query/answer processing remain shared, enabling a test-time intervention called back-patching that injects deeper, text-aligned visual representations into earlier layers. Across three VLMs and five tasks, back-patching yields an average accuracy boost of about 4.6 percentage points and closes roughly 32% of the visual-textual performance gap, suggesting a training-free path to improved multi-modal performance. The findings emphasize the value of understanding modality-specific circuits and point to targeted inference-time modifications as a practical route to reducing cross-modal gaps in VLMs.

Abstract

Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

Paper Structure

This paper contains 37 sections, 13 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Overview of our analysis. (a) We find circuits for analogous vision and language tasks and show they are structurally disjoint---different model components are responsible for each modality. (b) Swapping sub-circuits across modalities (shown for the language circuit, but applies similarly to vision) reveals that query and generation components preserve performance when swapped between modalities, while swapping data components degrades performance. (c) To address the performance gap, we apply back-patching: re-injecting visual data activations from later layers into earlier ones. This makes textually-aligned representations from deeper layers available during visual prompt processing, enhancing performance on visual tasks.
  • Figure 2: Analogous Vision-Language Tasks. We create a dataset of five question-answering tasks, each with a textual and visual variants. A task prompt is made up of a query (bottom row) asked either about an image (middle row) for the visual variant or on an analogous text (top row) for the textual variant.
  • Figure 3: Patching effects for Qwen2-7B-VL for the textual (left) and visual (right) counting task. We sum the patching effect (described in \ref{['sec:circuit-discovery']}) across all model components for a specific position and layer. This reveals different patterns of component importance by position, motivating the separation of each circuit to three sub-circuits---data, query and generation.
  • Figure 4: Circuit faithfulness across models and tasks. We measure the faithfulness at each circuit size, for each model, task and modality. The circuits we further analyze are the minimal circuits that achieve faithfulness of over 80%.
  • Figure 5: Normalized IoU scores. We measure the IoU between the component set of the textual and visual circuits for each model and task and normalize it using a random baseline. We find that across models and tasks, the intersection of components in data token positions (blue) is close to zero and the intersection of components in query token positions (orange) is very low. The intersection of components in the last token position (green) varies between tasks.
  • ...and 9 more figures