What Color Is It? A Text-Interference Multimodal Hallucination Benchmark
Jinkun Zhao, Lei Huang, Haixin Ge, Wenjun Wu
TL;DR
This paper identifies a critical gap in color perception for multimodal models by showing that semantic interference embedded in images can drive颜色 hallucinations. It introduces the What Color Is It benchmark with color, simple, and mask subsets to trigger single-modality visual hallucinations and quantify two interference types. Across 12 MLLMs, the study reveals that models often rely on textual cues within images rather than the actual visual color, with masking mitigating but not eliminating interference. The findings highlight the need for stronger visual grounding and diagnostic tools to improve color perception robustness in multimodal systems.
Abstract
With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the "What Color Is It" dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.
