What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

Jinkun Zhao; Lei Huang; Haixin Ge; Wenjun Wu

What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

Jinkun Zhao, Lei Huang, Haixin Ge, Wenjun Wu

TL;DR

This paper identifies a critical gap in color perception for multimodal models by showing that semantic interference embedded in images can drive颜色 hallucinations. It introduces the What Color Is It benchmark with color, simple, and mask subsets to trigger single-modality visual hallucinations and quantify two interference types. Across 12 MLLMs, the study reveals that models often rely on textual cues within images rather than the actual visual color, with masking mitigating but not eliminating interference. The findings highlight the need for stronger visual grounding and diagnostic tools to improve color perception robustness in multimodal systems.

Abstract

With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the "What Color Is It" dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.

What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

TL;DR

Abstract

What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)