MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Wenqian Ye; Bohan Liu; Guangtao Zheng; Di Wang; Yunsheng Ma; Xu Cao; Bolin Lai; James M. Rehg; Aidong Zhang

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Wenqian Ye, Bohan Liu, Guangtao Zheng, Di Wang, Yunsheng Ma, Xu Cao, Bolin Lai, James M. Rehg, Aidong Zhang

TL;DR

This work defines and quantifies spurious biases in multimodal LLMs, arguing that vision-derived spurious attributes can distort vision–language alignment. It introduces MM-SpuBench, a VQA benchmark with 10,773 images and 2,400 QAs across nine bias types, enriched with core/spurious attribute annotations to explicitly separate essential and non-essential cues. Through extensive experiments on open- and closed-source MLLMs, the paper shows persistent reliance on spurious correlations, with robustness improving alongside model size and alignment quality, and with concept-informed, reasoning-enabled strategies providing notable mitigation. The benchmark and analysis highlight the need for stronger cross-modal alignment techniques and more robust architectures to advance trustworthy multimodal AI systems.

Abstract

Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe robustness pitfall in deep learning models trained on single modality data. Multimodal Large Language Models (MLLMs), which integrate both vision and language models, have demonstrated strong capability in joint vision-language understanding. However, whether spurious biases are prevalent in MLLMs remains under-explored. We mitigate this gap by analyzing the spurious biases in a multimodal setting, uncovering the specific test data patterns that can manifest this problem when biases in the vision model cascade into the alignment between visual and text tokens in MLLMs. To better understand this problem, we introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct categories of spurious correlations from five open-source image datasets. The VQA dataset is built from human-understandable concept information (attributes). Leveraging this benchmark, we conduct a thorough evaluation of current state-of-the-art MLLMs. Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases. To support the MLLM robustness research, we release our VQA benchmark at https://huggingface.co/datasets/mmbench/MM-SpuBench.

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

TL;DR

Abstract

Paper Structure (44 sections, 1 theorem, 6 equations, 5 figures, 7 tables)

This paper contains 44 sections, 1 theorem, 6 equations, 5 figures, 7 tables.

Introduction
Related Works
Robustness in multimodal LLMs.
Spurious attribute detection.
Benchmarks on multimodal LLMs.
Spurious Biases in Multimodal LLMs
Problem Setting
From Single Modality to Multi-modality
How to Reveal Multimodal Spurious Bias
The Multimodal Spurious Benchmark (MM-SpuBench)
Types of Spurious Correlations
Construction of MM-SpuBench
Image pre-selection.
Type identification and attribute extraction.
Visual Question Answering (VQA) generation.
...and 29 more sections

Key Result

Proposition 3.1

Given that the vision and the language modalities are weakly correlated and that conditional distributions in the vision and language modalities have the following relations: the inequality in Eq. eq:multimodal holds.

Figures (5)

Figure 1: Comparative performance of different MLLMs across 9 types of spurious biases in MM-SpuBench.
Figure 2: Illustration of multimodal spurious bias: From the training data of MLLMs, these models aim to learn visual grounding through instance-level correlations between visual objects and text descriptions. During inference, these correlations can be influenced by other attributes which refer to the same object attributes. In this case, we break the previous correlations and reveal the underlying spurious correlations learned by the model. It can be observed by the failures to accurately interpret the objects in the vision modality.
Figure 3: Construction of the MM-SpuBench. Left: Pre-select images where CLIP's true class prediction is not in top $k$ but is in top $l$. Middle: Use GPT-4V to identify spurious correlations and lists core/spurious attributes. Right: Generate multiple-choice questions based on the spurious bias type and core/spurious attributes.
Figure 4: Overview of the MM-SpuBench. (a) Distribution of spurious correlation types. (b) Selected attributes within each spurious correlation type. Note that there might be shared attributes in different types since each image may contain at most two types of spurious correlations.
Figure 5: Examples of ground truth and misclassified labels, illustrating spurious correlations.

Theorems & Definitions (2)

Definition 3.1: Multimodal Spurious Bias
Proposition 3.1

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

TL;DR

Abstract

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)