Investigating Mechanisms for In-Context Vision Language Binding
Darshana Saravanan, Makarand Tapaswi, Vineet Gandhi
TL;DR
The paper investigates how Vision-Language models bind visual objects to textual references by introducing the Shapes task, a controlled setting requiring mapping image patches (O0,O1) with attributes (C0,C1,I0,I1) to textual descriptions. It formalizes a Binding ID hypothesis, decomposing activations into content and binding components and tests via causal interventions and position-invariance checks. Results show distinct binding vectors associated with object, color, and item tokens that link image regions to text across modalities; exchanging or perturbing these bindings alters or swaps predictions, while color bindings often remain invariant, aligning with a separable binding subspace. These findings advance interpretability of VLMs, suggesting manipulable, cross-modal binding mechanisms that could impact robust reasoning and safety-sensitive deployments.
Abstract
To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an 'image of a red toy car', the model should associate this image to phrases like 'car', 'red toy', 'red object', etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object's image tokens and its textual references, enabling in-context association.
