Table of Contents
Fetching ...

Color in Visual-Language Models: CLIP deficiencies

Guillem Arias, Ramon Baldrich, Maria Vanrell

TL;DR

This work investigates how color is encoded in CLIP, revealing two major deficiencies: a bias against achromatic colors and a strong tendency to rely on textual cues over perceptual color. Through synthetic color experiments and Stroop-style tests, the study shows CLIP's color labeling is reliable for chromatic cues but falters with achromatic stimuli and color perception tasks involving text, indicating a reading-dominant bias. Neuron-level analyses introduce a Color-Label Selectivity Index and identify color multi-modal neurons in shallow layers, suggesting color concepts are distributed across the network and influenced by training data. The findings highlight the need to refine color representation mechanisms in multimodal models to achieve more human-like color understanding and robustness in real-world scenarios.

Abstract

This work explores how color is encoded in CLIP (Contrastive Language-Image Pre-training) which is currently the most influential VML (Visual Language model) in Artificial Intelligence. After performing different experiments on synthetic datasets created for this task, we conclude that CLIP is able to attribute correct color labels to colored visual stimulus, but, we come across two main deficiencies: (a) a clear bias on achromatic stimuli that are poorly related to the color concept, thus white, gray and black are rarely assigned as color labels; and (b) the tendency to prioritize text over other visual information. Here we prove it is highly significant in color labelling through an exhaustive Stroop-effect test. With the aim to find the causes of these color deficiencies, we analyse the internal representation at the neuron level. We conclude that CLIP presents an important amount of neurons selective to text, specially in deepest layers of the network, and a smaller amount of multi-modal color neurons which could be the key of understanding the concept of color properly. Our investigation underscores the necessity of refining color representation mechanisms in neural networks to foster a more comprehensive comprehension of colors as humans understand them, thereby advancing the efficacy and versatility of multimodal models like CLIP in real-world scenarios.

Color in Visual-Language Models: CLIP deficiencies

TL;DR

This work investigates how color is encoded in CLIP, revealing two major deficiencies: a bias against achromatic colors and a strong tendency to rely on textual cues over perceptual color. Through synthetic color experiments and Stroop-style tests, the study shows CLIP's color labeling is reliable for chromatic cues but falters with achromatic stimuli and color perception tasks involving text, indicating a reading-dominant bias. Neuron-level analyses introduce a Color-Label Selectivity Index and identify color multi-modal neurons in shallow layers, suggesting color concepts are distributed across the network and influenced by training data. The findings highlight the need to refine color representation mechanisms in multimodal models to achieve more human-like color understanding and robustness in real-world scenarios.

Abstract

This work explores how color is encoded in CLIP (Contrastive Language-Image Pre-training) which is currently the most influential VML (Visual Language model) in Artificial Intelligence. After performing different experiments on synthetic datasets created for this task, we conclude that CLIP is able to attribute correct color labels to colored visual stimulus, but, we come across two main deficiencies: (a) a clear bias on achromatic stimuli that are poorly related to the color concept, thus white, gray and black are rarely assigned as color labels; and (b) the tendency to prioritize text over other visual information. Here we prove it is highly significant in color labelling through an exhaustive Stroop-effect test. With the aim to find the causes of these color deficiencies, we analyse the internal representation at the neuron level. We conclude that CLIP presents an important amount of neurons selective to text, specially in deepest layers of the network, and a smaller amount of multi-modal color neurons which could be the key of understanding the concept of color properly. Our investigation underscores the necessity of refining color representation mechanisms in neural networks to foster a more comprehensive comprehension of colors as humans understand them, thereby advancing the efficacy and versatility of multimodal models like CLIP in real-world scenarios.

Paper Structure

This paper contains 11 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: CLIP architecture set for a color naming task. The input image in the visual encoder is contrasted with several color labels within the input text. The output is the label that maximizes the visual and text embedding (Example: Input Text is "The background is { color }" and Input Image is a green triangle with a pink background).
  • Figure 2: Distribution of Color Selective Neurons in CLIP Visual Encoder Layers.
  • Figure 3: Hue Selectivity Distribution in the Visual Encoder vs. Imagenet Distribution (Pearson's correlation coefficient R=0.965).