ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin, M. Firoz Ahmed, Md. Mushtaq Shahriyar Rafee
TL;DR
ColorFoil introduces a foil-based benchmark to probe color perception in large Vision-Language models under zero-shot evaluation, using color word substitutions in captions derived from MS COCO and Flickr30k. The method assesses whether models assign higher likelihood to the original caption over foil captions when conditioned on the corresponding image. Results show ViLT and BridgeTower outperform CLIP-based variants and GroupViT in color discrimination, highlighting architecture-specific robustness in visual-language alignment. The work provides a quantitative, cross-dataset view of color-perception capabilities and motivates further robustness enhancements for real-world applicability.
Abstract
With the utilization of Transformer architecture, large Vision and Language (V&L) models have shown promising performance in even zero-shot settings. Several studies, however, indicate a lack of robustness of the models when dealing with complex linguistics and visual attributes. In this work, we introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to assess the models' perception ability to detect colors like red, white, green, etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing findings from the V&L models. The experimental evaluation indicates that ViLT and BridgeTower demonstrate much better color perception capabilities compared to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT struggle to distinguish colors that are visually distinct to humans with normal color perception ability.
