Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Kuofeng Gao; Yang Bai; Jiawang Bai; Yong Yang; Shu-Tao Xia

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, Shu-Tao Xia

TL;DR

The paper addresses the vulnerability of visual grounding in Multimodal LLMs to adversarial perturbations by focusing on referring expression comprehension (REC). It introduces three attack paradigms—untargeted, exclusive targeted, and permuted targeted—and evaluates them on MiniGPT-v2 (7B) across RefCOCO, RefCOCO+, and RefCOCOg using PGD with $\epsilon=16$ and IoU-based success criteria. Key findings show strong effectiveness of untargeted attacks (IoU@0.5 dropping to $23.24\%$ and $33.90\%$ for the two methods) and dramatic improvements for targeted attacks (exclusive from $0.13\%$ to $61.84\%$, permuted to $30.14\%$), demonstrating distinct challenges across attack types. The results establish a strong baseline for adversarial robustness in visual grounding and highlight the need for defense strategies to ensure trustworthy deployment of MLLMs in vision-language tasks.

Abstract

Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs.

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

TL;DR

and IoU-based success criteria. Key findings show strong effectiveness of untargeted attacks (IoU@0.5 dropping to

and

for the two methods) and dramatic improvements for targeted attacks (exclusive from

, permuted to

), demonstrating distinct challenges across attack types. The results establish a strong baseline for adversarial robustness in visual grounding and highlight the need for defense strategies to ensure trustworthy deployment of MLLMs in vision-language tasks.

Abstract

Paper Structure (12 sections, 4 equations, 1 figure, 2 tables)

This paper contains 12 sections, 4 equations, 1 figure, 2 tables.

Introduction
The Proposed Attack
Preliminaries
Untargeted Adversarial Attacks
Targeted Adversarial Attacks
Experiments
Experimental Setups
Main Results
Related Work
Multimodal Large Language Models
Adversarial Attacks
Conclusion

Figures (1)

Figure 1: Three adversarial attack paradigms are proposed to evaluate the adversarial robustness for visual grounding of MLLMs.

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

TL;DR

Abstract

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)