Table of Contents
Fetching ...

On the robustness of multimodal language model towards distractions

Ming Liu, Hao Chen, Jindong Wang, Wensheng Zhang

TL;DR

Vision-language models struggle under noisy inputs common in real-world scenarios. The authors introduce I-ScienceQA, a distraction-augmented benchmark built on ScienceQA, to evaluate robustness across four distraction types using 14 VLMs. Textual distractions tend to degrade performance more than visual ones, larger and more capable architectures generally show greater resilience, and prompt-based defenses offer only partial mitigation; data contamination risks complicate interpretation. The work provides a valuable dataset and insights to guide robust multimodal reasoning and defense development in practical settings.

Abstract

Although vision-language models (VLMs) have achieved significant success in various applications such as visual question answering, their resilience to prompt variations remains an under-explored area. Understanding how distractions affect VLMs is crucial for improving their real-world applicability, as inputs could have noisy and irrelevant information in many practical scenarios. This paper aims to assess the robustness of VLMs against both visual and textual distractions in the context of science question answering. Built on the ScienceQA dataset, we developed a new benchmark that introduces distractions in both the visual and textual contexts to evaluate the reasoning capacity of VLMs amid these distractions. Our findings reveal that most-of-the-art VLMs, including GPT-4, are vulnerable to various types of distractions, experiencing noticeable degradation in reasoning capabilities when confronted with distractions. Notably, models such as InternVL2 demonstrate a higher degree of robustness to these distractions. We also found that models exhibit greater sensitivity to textual distractions than visual ones. Additionally, we explored various mitigation strategies, such as prompt engineering, to counteract the impact of distractions. While these strategies improved solution accuracy, our analysis shows that there remain significant opportunities for improvement.

On the robustness of multimodal language model towards distractions

TL;DR

Vision-language models struggle under noisy inputs common in real-world scenarios. The authors introduce I-ScienceQA, a distraction-augmented benchmark built on ScienceQA, to evaluate robustness across four distraction types using 14 VLMs. Textual distractions tend to degrade performance more than visual ones, larger and more capable architectures generally show greater resilience, and prompt-based defenses offer only partial mitigation; data contamination risks complicate interpretation. The work provides a valuable dataset and insights to guide robust multimodal reasoning and defense development in practical settings.

Abstract

Although vision-language models (VLMs) have achieved significant success in various applications such as visual question answering, their resilience to prompt variations remains an under-explored area. Understanding how distractions affect VLMs is crucial for improving their real-world applicability, as inputs could have noisy and irrelevant information in many practical scenarios. This paper aims to assess the robustness of VLMs against both visual and textual distractions in the context of science question answering. Built on the ScienceQA dataset, we developed a new benchmark that introduces distractions in both the visual and textual contexts to evaluate the reasoning capacity of VLMs amid these distractions. Our findings reveal that most-of-the-art VLMs, including GPT-4, are vulnerable to various types of distractions, experiencing noticeable degradation in reasoning capabilities when confronted with distractions. Notably, models such as InternVL2 demonstrate a higher degree of robustness to these distractions. We also found that models exhibit greater sensitivity to textual distractions than visual ones. Additionally, we explored various mitigation strategies, such as prompt engineering, to counteract the impact of distractions. While these strategies improved solution accuracy, our analysis shows that there remain significant opportunities for improvement.

Paper Structure

This paper contains 15 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Diagram illustrating various scenario of distraction we apply to the samples in Science-QA dataset.
  • Figure 2: Comparison of Exact Match Score for Internvl2(top) and Llava Models(bottom).
  • Figure 3: Training datasets, vision encoders, and language models for LLaVA, CogVLM2, InstructBLIP, and InternVL2. Non-QA datasets are connected with lighter lines. InternVL2 employs the most diverse QA datasets, enhancing its robustness. Connections to the ScienceQA dataset are highlighted. See Appendix for details.