Table of Contents
Fetching ...

B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions

Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Nanning Zheng, Kaipeng Zhang

TL;DR

B-AVIBench addresses the vulnerability of large vision-language systems to black-box adversarial visual-instructions by constructing a large, LVLM-agnostic benchmark (316K B-AVIs) spanning image, text, and bias perturbations. The framework evaluates 14 open-source LVLMs and two closed-source systems, uncovering substantial weaknesses and persistent biases even in advanced models. It introduces novel attack types (image corruptions, decision-based image attacks, text attacks, and content-bias AVIs) and demonstrates varying degrees of robustness across architectures and adapters, highlighting the need for targeted defenses and fair evaluation. The work provides an open-source toolkit and dataset to advance robustness, safety, and fairness in multimodal foundation models.

Abstract

Large Vision-Language Models (LVLMs) have shown significant progress in responding well to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce B-AVIBench, a framework designed to analyze the robustness of LVLMs when facing various Black-box Adversarial Visual-Instructions (B-AVIs), including four types of image-based B-AVIs, ten types of text-based B-AVIs, and nine types of content bias B-AVIs (such as gender, violence, cultural, and racial biases, among others). We generate 316K B-AVIs encompassing five categories of multimodal capabilities (ten tasks) and content bias. We then conduct a comprehensive evaluation involving 14 open-source LVLMs to assess their performance. B-AVIBench also serves as a convenient tool for practitioners to evaluate the robustness of LVLMs against B-AVIs. Our findings and extensive experimental results shed light on the vulnerabilities of LVLMs, and highlight that inherent biases exist even in advanced closed-source LVLMs like GeminiProVision and GPT-4V. This underscores the importance of enhancing the robustness, security, and fairness of LVLMs. The source code and benchmark are available at https://github.com/zhanghao5201/B-AVIBench.

B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions

TL;DR

B-AVIBench addresses the vulnerability of large vision-language systems to black-box adversarial visual-instructions by constructing a large, LVLM-agnostic benchmark (316K B-AVIs) spanning image, text, and bias perturbations. The framework evaluates 14 open-source LVLMs and two closed-source systems, uncovering substantial weaknesses and persistent biases even in advanced models. It introduces novel attack types (image corruptions, decision-based image attacks, text attacks, and content-bias AVIs) and demonstrates varying degrees of robustness across architectures and adapters, highlighting the need for targeted defenses and fair evaluation. The work provides an open-source toolkit and dataset to advance robustness, safety, and fairness in multimodal foundation models.

Abstract

Large Vision-Language Models (LVLMs) have shown significant progress in responding well to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce B-AVIBench, a framework designed to analyze the robustness of LVLMs when facing various Black-box Adversarial Visual-Instructions (B-AVIs), including four types of image-based B-AVIs, ten types of text-based B-AVIs, and nine types of content bias B-AVIs (such as gender, violence, cultural, and racial biases, among others). We generate 316K B-AVIs encompassing five categories of multimodal capabilities (ten tasks) and content bias. We then conduct a comprehensive evaluation involving 14 open-source LVLMs to assess their performance. B-AVIBench also serves as a convenient tool for practitioners to evaluate the robustness of LVLMs against B-AVIs. Our findings and extensive experimental results shed light on the vulnerabilities of LVLMs, and highlight that inherent biases exist even in advanced closed-source LVLMs like GeminiProVision and GPT-4V. This underscores the importance of enhancing the robustness, security, and fairness of LVLMs. The source code and benchmark are available at https://github.com/zhanghao5201/B-AVIBench.
Paper Structure (25 sections, 2 equations, 6 figures, 6 tables)

This paper contains 25 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The overview of B-AVIBench.
  • Figure 2: Comparison of LVLMs' robustness scores of black-box adversarial visual-instructions for each LVLM. In each subfigure, we list the five most robust LVLMs under the corresponding attack, with the number inside the triangle indicating the rank. The definition of the robustness score is shown in Section \ref{['4-a']}.
  • Figure 3: Results of B-AVIs on the LLaVA-1.5. (a) Image corruption example. (b) Decision-based optimized black-box image attack example. (c) Black-box text attack example. (d) Content bias attack example.
  • Figure 4: More examples of B-AVIs. (a) Image corruption B-AVIs, including 19 types of corruptions from hendrycks2018benchmarking, with the third level of corruption. (b) Decision-based optimized black-box image attack B-AVIs for LLaVA-1.5. (c) Black-box text attack B-AVIs for LLaVA-1.5. Due to ethical considerations, we do not display additional Content Bias AVIs.
  • Figure 5: Further Analysis: Relationship between robustness score to B-AVIs and (a) Tuning parameters, (b) Vicuna adapters, (c) LLaMA adapters, (d) Training data volume, (e) LLMs. FC, Q$-$f., Full_Tu., B_Tu., Resa., and RedPajama represent Fully connected layer llava, Q$-$Former li2023blip, Full Tuning chen2023sharegpt4v, Bias Tuning gao2023llama, Resampler openfamingov2, and RedPajama$-$INCITE$-$Instruct openfamingov2 respectively.
  • ...and 1 more figures