Table of Contents
Fetching ...

Investigating Mechanisms for In-Context Vision Language Binding

Darshana Saravanan, Makarand Tapaswi, Vineet Gandhi

TL;DR

The paper investigates how Vision-Language models bind visual objects to textual references by introducing the Shapes task, a controlled setting requiring mapping image patches (O0,O1) with attributes (C0,C1,I0,I1) to textual descriptions. It formalizes a Binding ID hypothesis, decomposing activations into content and binding components and tests via causal interventions and position-invariance checks. Results show distinct binding vectors associated with object, color, and item tokens that link image regions to text across modalities; exchanging or perturbing these bindings alters or swaps predictions, while color bindings often remain invariant, aligning with a separable binding subspace. These findings advance interpretability of VLMs, suggesting manipulable, cross-modal binding mechanisms that could impact robust reasoning and safety-sensitive deployments.

Abstract

To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an 'image of a red toy car', the model should associate this image to phrases like 'car', 'red toy', 'red object', etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object's image tokens and its textual references, enabling in-context association.

Investigating Mechanisms for In-Context Vision Language Binding

TL;DR

The paper investigates how Vision-Language models bind visual objects to textual references by introducing the Shapes task, a controlled setting requiring mapping image patches (O0,O1) with attributes (C0,C1,I0,I1) to textual descriptions. It formalizes a Binding ID hypothesis, decomposing activations into content and binding components and tests via causal interventions and position-invariance checks. Results show distinct binding vectors associated with object, color, and item tokens that link image regions to text across modalities; exchanging or perturbing these bindings alters or swaps predictions, while color bindings often remain invariant, aligning with a separable binding subspace. These findings advance interpretability of VLMs, suggesting manipulable, cross-modal binding mechanisms that could impact robust reasoning and safety-sensitive deployments.

Abstract

To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an 'image of a red toy car', the model should associate this image to phrases like 'car', 'red toy', 'red object', etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object's image tokens and its textual references, enabling in-context association.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Shapes Task. Given an image with two 3D objects and a text description (context), the model needs to comprehend the question and identify the correct item (token_p) contained in the queried object. Image and text tokens highlighted with the same color are expected to contain the same binding IDs, allowing the model to predict the correct answer.
  • Figure 2: Causal intervention. In steps 1 and 2, activations from the first and second samples are saved. In step 3, object/color/item activations in the first sample are replaced with those from the second. This new activation stack is frozen, and the model is queried with all four objects to observe the change in predictions.
  • Figure 3: Factorizability results. Each row shows the model's mean log probabilities of an item contained in an object. The first grid in each case shows results with unaltered activations. Squares highlighted in red denote the expected predictions based on our hypothesis. Model outputs match hypothesis suggesting a multimodal binding ID mechanism.
  • Figure 4: Mean intervention samples.
  • Figure 5: Position independence results. The integers in the x-axis show how much the position of the first and second objects/items/colors are incremented and decremented respectively. The green line corresponds to no change in positions and the gray line corresponds to swapped positions. In all cases $O_k \leftrightarrow I_k$ (blue solid $O_0, I_0$ and oranged dashed $O_1, I_1$) have a higher probability than $O_k \leftrightarrow I_k'$.