If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Reza Esfandiarpoor, Cristina Menghini, Stephen H. Bach
TL;DR
The paper tackles the problem that Vision-Language Models (VLMs) may encode concepts using non-visual or even spurious textual cues rather than solely visual attributes. It introduces Extract and Explore (EX2), a framework that uses reinforcement learning to align a large language model with VLM preferences, generating descriptions that mirror what the VLM uses to represent concepts. By analyzing these aligned descriptions, the study shows that spurious descriptions and non-visual attributes (such as habitat) frequently influence VLM representations, with different VLMs prioritizing different attributes and even the same VLM shifting emphasis across datasets. The work demonstrates EX2’s value for downstream hypothesis generation and broad VLM analysis, highlighting implications for pre-training data design and the need to curb reliance on non-visual cues in vision-language understanding.
Abstract
Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize textual features that are important for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate features that are important for the VLM. Then, we inspect the descriptions to identify features that contribute to VLM representations. Using EX2, we find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat (e.g., North America) to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
