Table of Contents
Fetching ...

The Geometry of Representational Failures in Vision Language Models

Daniele Savietto, Declan Campbell, André Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, Alan Perotti

TL;DR

The paper investigates why vision-language models exhibit binding-like errors in multi-object scenes by positing geometric interference in shared latent spaces as the root cause. It introduces concept vectors extracted via supervised discrimination and centroid-based geometric distillation, with a PCA-regularized variant to enforce compositional structure. Causal validation is achieved through activation steering, which reorients internal representations and demonstrably alters model perception in natural and synthetic tasks. Across three open-weight VLMs, the authors show universal geometric signatures, demonstrate steering as a causal mechanism, and connect these findings to the Curse of Generalization, offering a quantitative framework linking internal geometry to external behavior with broad implications for mechanistic interpretability and model design.

Abstract

Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the "Binding Problem", the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors" - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

The Geometry of Representational Failures in Vision Language Models

TL;DR

The paper investigates why vision-language models exhibit binding-like errors in multi-object scenes by positing geometric interference in shared latent spaces as the root cause. It introduces concept vectors extracted via supervised discrimination and centroid-based geometric distillation, with a PCA-regularized variant to enforce compositional structure. Causal validation is achieved through activation steering, which reorients internal representations and demonstrably alters model perception in natural and synthetic tasks. Across three open-weight VLMs, the authors show universal geometric signatures, demonstrate steering as a causal mechanism, and connect these findings to the Curse of Generalization, offering a quantitative framework linking internal geometry to external behavior with broad implications for mechanistic interpretability and model design.

Abstract

Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the "Binding Problem", the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors" - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.
Paper Structure (24 sections, 7 equations, 10 figures, 5 tables)

This paper contains 24 sections, 7 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: (a) Example of image used to train the attentive probes. (b) Example of image used to distill the concept vectors for "red square". In this case we also show the VLM tokenization grid.
  • Figure 2: Example of causal intervention on a natural image description task. The image, prompt and model weights are unchanged; we manipulated the model's activations in order to force the 'red' color to be perceived as blue.
  • Figure 3: Matrix of cosine similarities between 36 color-shape concept vectors (Gemma). The block structure reflects shared colors (large blocks) and shapes (sub-diagonals). Bottom: similarity distributions for object pairs sharing color, shape, or neither.
  • Figure 4: (a) Heatmap of the cosine similarities between centroid-based hue concept vectors found in the vision embeddings of Qwen. (b) Projection of color representations in the first 3 principal components. (c,d) Semantic Similarity Function $g_h(\Delta)$ for different hues ($h$ corresponds to the color of the curve). The black line represents the average function $g(\Delta)$.
  • Figure 5: Visual search task. The query reads "Is there a purple heart in the image? Answer YES or NO". (a) Target present with dissimilar distractors. (b) Target absent but distractors share features with target (purple star, blue heart), creating high interference.
  • ...and 5 more figures