The Geometry of Representational Failures in Vision Language Models
Daniele Savietto, Declan Campbell, André Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, Alan Perotti
TL;DR
The paper investigates why vision-language models exhibit binding-like errors in multi-object scenes by positing geometric interference in shared latent spaces as the root cause. It introduces concept vectors extracted via supervised discrimination and centroid-based geometric distillation, with a PCA-regularized variant to enforce compositional structure. Causal validation is achieved through activation steering, which reorients internal representations and demonstrably alters model perception in natural and synthetic tasks. Across three open-weight VLMs, the authors show universal geometric signatures, demonstrate steering as a causal mechanism, and connect these findings to the Curse of Generalization, offering a quantitative framework linking internal geometry to external behavior with broad implications for mechanistic interpretability and model design.
Abstract
Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the "Binding Problem", the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors" - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.
