Table of Contents
Fetching ...

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

Reza Esfandiarpoor, Cristina Menghini, Stephen H. Bach

TL;DR

The paper tackles the problem that Vision-Language Models (VLMs) may encode concepts using non-visual or even spurious textual cues rather than solely visual attributes. It introduces Extract and Explore (EX2), a framework that uses reinforcement learning to align a large language model with VLM preferences, generating descriptions that mirror what the VLM uses to represent concepts. By analyzing these aligned descriptions, the study shows that spurious descriptions and non-visual attributes (such as habitat) frequently influence VLM representations, with different VLMs prioritizing different attributes and even the same VLM shifting emphasis across datasets. The work demonstrates EX2’s value for downstream hypothesis generation and broad VLM analysis, highlighting implications for pre-training data design and the need to curb reliance on non-visual cues in vision-language understanding.

Abstract

Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize textual features that are important for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate features that are important for the VLM. Then, we inspect the descriptions to identify features that contribute to VLM representations. Using EX2, we find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat (e.g., North America) to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

TL;DR

The paper tackles the problem that Vision-Language Models (VLMs) may encode concepts using non-visual or even spurious textual cues rather than solely visual attributes. It introduces Extract and Explore (EX2), a framework that uses reinforcement learning to align a large language model with VLM preferences, generating descriptions that mirror what the VLM uses to represent concepts. By analyzing these aligned descriptions, the study shows that spurious descriptions and non-visual attributes (such as habitat) frequently influence VLM representations, with different VLMs prioritizing different attributes and even the same VLM shifting emphasis across datasets. The work demonstrates EX2’s value for downstream hypothesis generation and broad VLM analysis, highlighting implications for pre-training data design and the need to curb reliance on non-visual cues in vision-language understanding.

Abstract

Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize textual features that are important for VLMs. EX2 uses reinforcement learning to align a large language model with VLM preferences and generates descriptions that incorporate features that are important for the VLM. Then, we inspect the descriptions to identify features that contribute to VLM representations. Using EX2, we find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat (e.g., North America) to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.
Paper Structure (25 sections, 1 equation, 4 figures, 20 tables)

This paper contains 25 sections, 1 equation, 4 figures, 20 tables.

Figures (4)

  • Figure 1: Extract: we align Mistral with VLM preferences and generate descriptions that contain features that are important for the VLM. Explore: we examine various aspects of these descriptions to identify features that contribute to VLM representations.
  • Figure 2: Extract and Explore (EX2) overview. A) We use RL to fine-tune an LLM to generate concept descriptions that are closer to the corresponding images in the VLM embedding space, thus, the descriptions incorporate features that the VLM uses to represent the concepts. We use the aligned LLM to generate the VLM's preferred description for all concepts. B) We inspect these descriptions from various aspects, e.g., if they are informative or describe visual attributes. Based on the aggregate results, we draw conclusions about how the VLM represents concepts.
  • Figure 3: Breakdown of aligned descriptions for CLIP on Flowers. CLIP significantly relies on spurious or non-visual information to represent flower species.
  • Figure 4: Most common described attributes for CLIP and ALIGN for CUB and Flowers. Different VLMs prioritize different attributes to represent concepts. Even the same VLM prioritizes different attributes across datasets.