ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou
TL;DR
ColorBench addresses whether vision-language models genuinely understand color by introducing a dedicated, three-dimensional benchmark (Color Perception, Color Reasoning, Color Robustness) with 11 tasks and over 1.4k instances. It evaluates 32 VLMs across diverse architectures, examining how color cues influence perception and higher-level reasoning, and how robust models are to color transformations, including recolorings. Key findings show that the color-understanding scaling law exists but is modest and largely driven by language-model size, that absolute color performance is challenging with small inter-model gaps, and that Chain-of-Thought prompting can improve color-related accuracy and robustness while color cues can sometimes mislead. The work highlights critical gaps in current VLMs’ color comprehension and provides a foundation for future architectures and training strategies to achieve human-like color understanding in multimodal AI.
Abstract
Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
