Table of Contents
Fetching ...

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng

TL;DR

MVI-Bench introduces a comprehensive benchmark to evaluate LVLM robustness against misleading visual inputs using a three-level Visual Concept/Attribute/Relationship taxonomy and a paired dataset of 1,248 VQA instances (624 pairs). The framework pairs normal and misleading images to isolate visual cues and defines MVI-Sensitivity to quantify relative performance degradation. Across 18 LVLMs, results reveal pronounced vulnerabilities, especially in visual perception and spatial reasoning, and demonstrate that improvements in perception (via caption-assisted inference) or reasoning (via scaling or CoT) yield mixed, often non-monotonic gains. The study provides actionable insights—perception is the primary bottleneck, spurious correlations exist in current VQA training, and attention-guided analyses can diagnose reliance on misleading cues—advancing the development of more robust, reliable LVLMs; code and data are publicly available.

Abstract

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

TL;DR

MVI-Bench introduces a comprehensive benchmark to evaluate LVLM robustness against misleading visual inputs using a three-level Visual Concept/Attribute/Relationship taxonomy and a paired dataset of 1,248 VQA instances (624 pairs). The framework pairs normal and misleading images to isolate visual cues and defines MVI-Sensitivity to quantify relative performance degradation. Across 18 LVLMs, results reveal pronounced vulnerabilities, especially in visual perception and spatial reasoning, and demonstrate that improvements in perception (via caption-assisted inference) or reasoning (via scaling or CoT) yield mixed, often non-monotonic gains. The study provides actionable insights—perception is the primary bottleneck, spurious correlations exist in current VQA training, and attention-guided analyses can diagnose reliance on misleading cues—advancing the development of more robust, reliable LVLMs; code and data are publicly available.

Abstract

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

Paper Structure

This paper contains 23 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: (a) Misleading Textual Input: misleading questions are created by injecting inaccurate or irrelevant information into otherwise normal queries. (b) Misleading Visual Input: misleading visual cues arise from real-world scenes, causing models to misinterpret the image content (e.g., stools mistaken for mushrooms).
  • Figure 2: Examples from six misleading categories defined in MVI-Bench. Each pair contains a normal image (left) and misleading image (right) with the same MCQ and corresponding ground-truth answer. For the misleading image, an additional distractor option is shown alongside the correct answer. Answer choices are omitted for brevity (see Fig. \ref{['fig:think']} for full format).
  • Figure 3: Overview of MVI-Bench statistics. (a) Six balanced misleading visual categories. (b) Three diverse image sources: natural, synthetic, and edited. (c) Broad object coverage across multiple domains. (d) High pairwise similarity ensures semantic consistency between normal and misleading image pairs.
  • Figure 4: Comparison between the "non-think" and "think" modes of SAIL-VL. In the non-think mode, the model answers directly based on visual evidence, while in the think mode, the model is guided by historical thoughts and tend to overemphasize fine details.
  • Figure 5: Attention-guided masking for a counterintuitive instance. Qwen2.5-VL-7B spuriously associates a receipt with a book. (a) On the normal image with one book, it answers incorrectly. (b) On the misleading image, it coincidentally answers “2” by counting the receipt as an extra book. (c) Masking the receipt flips the prediction, confirming the spurious correlation.
  • ...and 3 more figures