Table of Contents
Fetching ...

Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li

TL;DR

This work introduces a comprehensive robustness benchmark for multimodal image-text models under distribution shifts by applying 17 image perturbations and 16 text perturbations across five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). It evaluates 12 open-source models and proposes two new metrics, MultiModal Impact score (MMI) and Missing Object Rate (MOR), to quantify robustness and generation fidelity. Key findings show that image perturbations, especially zoom blur, are more damaging than text perturbations, with character-level text perturbations being particularly disruptive; BLIP-based models often exhibit stronger robustness, potentially due to generative losses. The study further uses Optimal Transport alignments and Grad-CAM visualizations to interpret failure modes and discusses implications for unimodal robustness and future directions, including data augmentation and fairness considerations for robust multimodal systems.

Abstract

Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating robustness against distribution shifts is crucial before adopting them in real-world applications. In this work, we investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (\textbf{MMI} for MultiModal Impact score and \textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models. More details can be found on the project webpage: \url{https://MMRobustness.github.io}.

Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

TL;DR

This work introduces a comprehensive robustness benchmark for multimodal image-text models under distribution shifts by applying 17 image perturbations and 16 text perturbations across five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). It evaluates 12 open-source models and proposes two new metrics, MultiModal Impact score (MMI) and Missing Object Rate (MOR), to quantify robustness and generation fidelity. Key findings show that image perturbations, especially zoom blur, are more damaging than text perturbations, with character-level text perturbations being particularly disruptive; BLIP-based models often exhibit stronger robustness, potentially due to generative losses. The study further uses Optimal Transport alignments and Grad-CAM visualizations to interpret failure modes and discusses implications for unimodal robustness and future directions, including data augmentation and fairness considerations for robust multimodal systems.

Abstract

Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating robustness against distribution shifts is crucial before adopting them in real-world applications. In this work, we investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (\textbf{MMI} for MultiModal Impact score and \textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models. More details can be found on the project webpage: \url{https://MMRobustness.github.io}.
Paper Structure (59 sections, 33 figures, 38 tables)

This paper contains 59 sections, 33 figures, 38 tables.

Figures (33)

  • Figure 1: Multimodal models are sensitive to image/text perturbations (original image-text pairs are shown in blue boxes, perturbed ones are in red). Image captioning (Top): Adding image perturbations can result in incorrect captions, e.g., the tabby kitten is mistakenly described as a woman/dog. Text-to-image generation (bottom): Applying text perturbations can result in the generated images containing incomplete visual information, e.g., the tree is missing in the two examples above.
  • Figure 2: Examples of our 17 image perturbations. The original image is taken from the COCO dataset and shown on the top left.
  • Figure 3: Optimal Transport (OT) alignment visualization between text and perturbed images, where pixelate and zoom blur are two high-effective image perturbation methods, brightness and glass blur are two low-effective ones.
  • Figure 4: Optimal Transport (OT) alignment visualization between perturbed text and images, where keyboard and character replace are two high-effective text perturbation methods, insert punctuation and formal are two soft ones.
  • Figure 5: (a) Image captioning results of BLIP; (b) Image captioning results of GRIT; (c) Grad-CAM visualizations on the cross-attention maps corresponding to individual words under image perturbations, where zoom blur and pixelate perturbed images show worse word-image attention alignment than the brightness perturbed image. For example, in zoom blur and pixelate, the "door" and "glasses" words' attention maps are not matched with the correct image patches, while in pixelate, all words' attention maps match correctly.
  • ...and 28 more figures