Table of Contents
Fetching ...

What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

Jinkun Zhao, Lei Huang, Haixin Ge, Wenjun Wu

TL;DR

This paper identifies a critical gap in color perception for multimodal models by showing that semantic interference embedded in images can drive颜色 hallucinations. It introduces the What Color Is It benchmark with color, simple, and mask subsets to trigger single-modality visual hallucinations and quantify two interference types. Across 12 MLLMs, the study reveals that models often rely on textual cues within images rather than the actual visual color, with masking mitigating but not eliminating interference. The findings highlight the need for stronger visual grounding and diagnostic tools to improve color perception robustness in multimodal systems.

Abstract

With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the "What Color Is It" dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.

What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

TL;DR

This paper identifies a critical gap in color perception for multimodal models by showing that semantic interference embedded in images can drive颜色 hallucinations. It introduces the What Color Is It benchmark with color, simple, and mask subsets to trigger single-modality visual hallucinations and quantify two interference types. Across 12 MLLMs, the study reveals that models often rely on textual cues within images rather than the actual visual color, with masking mitigating but not eliminating interference. The findings highlight the need for stronger visual grounding and diagnostic tools to improve color perception robustness in multimodal systems.

Abstract

With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the "What Color Is It" dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.

Paper Structure

This paper contains 19 sections, 3 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Basic Structure of Samples in the "What color is it" Dataset. The samples consist of three core components (background, question, and text), where text is the key element. Multimodal models need to identify the visual color of the text based on the guidance of the question in the image. The text can be color-irrelevant words or color-related words (serving as distractors), while the background and question can be black-and-white or colored (as distractors) as needed.
  • Figure 2: Sample Image Examples of the "Color" Subset. Type 1 modifies only the color of the text, with the background and question remaining black-and-white; Type 2 alters the colors of both the text and the question, while the background stays white; Type 3 changes the color of each character in the text, with the background and question kept black-and-white; Type 4 adjusts the colors of two words in the text and the question, and the background remains white.
  • Figure 3: Sample Image Examples of the "Simple" Subset. The color processing method for each type is consistent with that of the "Color" Subset, except that the words in the text are replaced from color-related ones to those completely irrelevant to color.
  • Figure 4: Sample Image Examples of the "mask" Subset. The color processing method for each type is consistent with that of the "Color" Subset, except that partial characters in the text are randomly masked with "*".
  • Figure 5: Test Cases of the GLMv4-Thinking Model on the Color and Simple Subsets. Red parts indicate hallucinatory reasoning content, while green parts indicate correct reasoning content.
  • ...and 7 more figures