Table of Contents
Fetching ...

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou

TL;DR

ColorBench addresses whether vision-language models genuinely understand color by introducing a dedicated, three-dimensional benchmark (Color Perception, Color Reasoning, Color Robustness) with 11 tasks and over 1.4k instances. It evaluates 32 VLMs across diverse architectures, examining how color cues influence perception and higher-level reasoning, and how robust models are to color transformations, including recolorings. Key findings show that the color-understanding scaling law exists but is modest and largely driven by language-model size, that absolute color performance is challenging with small inter-model gaps, and that Chain-of-Thought prompting can improve color-related accuracy and robustness while color cues can sometimes mislead. The work highlights critical gaps in current VLMs’ color comprehension and provides a foundation for future architectures and training strategies to achieve human-like color understanding in multimodal AI.

Abstract

Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

TL;DR

ColorBench addresses whether vision-language models genuinely understand color by introducing a dedicated, three-dimensional benchmark (Color Perception, Color Reasoning, Color Robustness) with 11 tasks and over 1.4k instances. It evaluates 32 VLMs across diverse architectures, examining how color cues influence perception and higher-level reasoning, and how robust models are to color transformations, including recolorings. Key findings show that the color-understanding scaling law exists but is modest and largely driven by language-model size, that absolute color performance is challenging with small inter-model gaps, and that Chain-of-Thought prompting can improve color-related accuracy and robustness while color cues can sometimes mislead. The work highlights critical gaps in current VLMs’ color comprehension and provides a foundation for future architectures and training strategies to achieve human-like color understanding in multimodal AI.

Abstract

Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.

Paper Structure

This paper contains 42 sections, 2 equations, 91 figures, 11 tables.

Figures (91)

  • Figure 1: Test samples from ColorBench.ColorBench evaluates VLMs across three core capabilities: Perception, Reasoning and Robustness. The benchmark comprises 11 tasks designed to assess fine-grained color understanding abilities and the effect of color on other reasoning skills, including counting, proportion calculation, and robustness estimation. With over 1,400 instance, ColorBench covers a wide range of real-world application scenarios, including painting analysis, test kit readings, shopping, satellite/wildlife image analysis, etc.
  • Figure 2: Statistics of 3 categories and 11 tasks in ColorBench.
  • Figure 3: Generation Pipeline for Color Robustness. For each seed image, we apply $3$ recoloring strategies (Entire Image, Target Segment, Largest Segment) to generate edited images. For each strategy, we change the color of the recoloring region via shifting the Hue values by $90$°, $180$°, or $270$° in HSV color space.
  • Figure 4: The heatmaps related to performances and VLM sizes. Deeper color represents higher performance of P&R Overall Accuracy or Robustness. Each line represents a model family with the sizes growing from small to large. This visualization clearly shows the correlation between performances and model sizes, larger model leads to higher performance.
  • Figure 5: The percentage of change in accuracy (y-axis) by converting colorful images to grayscale in each ColorBench task (x-axis). Each violin plot visualizes the distribution over all VLMs. Higher (lower) percentage indicates that VLMs rely more (less) on color clues for the task. Positive (negative) percentage indicates degradation (improvement) on grayscale images. Color clues are indeed more or less leveraged by VLMs in most tasks but they might mislead VLMs (illusion & mimicry).
  • ...and 86 more figures