Table of Contents
Fetching ...

Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM

Yangyang Guo, Ziwei Xu, Xilie Xu, YongKang Wong, Liqiang Nie, Mohan Kankanhalli

TL;DR

The paper addresses vulnerabilities of Multi-Modal Large Language Models (MLLMs) to attacks that compromise Helpfulness, Honesty, and Harmlessness (H1–H3) of outputs, focusing on LLaVA 1.5. It proposes a two‑stage attack combining suffix injection (adding an incorrect pseudo-label to the query under a length and similarity constraint) with a Projected Gradient Descent (PGD) based perturbation of the image, guided by embedding similarity constraints $sim(\mathbf{v}_{cle}, \mathbf{v}_{adv}) > \beta_v$ and $sim(\mathbf{q}_{cle}, \mathbf{q}_{adv}) > \beta_q$ with $\beta_v=\beta_q=0.9$. GPT‑4o is used to generate pseudo-labels, which are manually verified, and the approach includes perturbing the query itself by concatenating words from the incorrect option; for Harmless questions, harmful content from existing visual adversarial resources is appended to induce unsafe outputs. Results indicate text-based attacks are more effective than image-only perturbations, with suffix injection exploiting language bias and adaptive strategies mitigating constraint violations; the authors acknowledge prompt misalignment as a key factor in observed discrepancies with external evaluations. The work highlights practical vulnerabilities in contemporary MLLMs and motivates defense strategies, prompting further exploration of robust prompt design and advanced PGD techniques.

Abstract

This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incorrectly labeled option (pseudo-labeled) to the original query as a suffix. Using this modified query, our second approach applies the PGD method to add imperceptible perturbations to the image. Combining these two techniques enables successful attacks on the LLaVA 1.5 model.

Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM

TL;DR

The paper addresses vulnerabilities of Multi-Modal Large Language Models (MLLMs) to attacks that compromise Helpfulness, Honesty, and Harmlessness (H1–H3) of outputs, focusing on LLaVA 1.5. It proposes a two‑stage attack combining suffix injection (adding an incorrect pseudo-label to the query under a length and similarity constraint) with a Projected Gradient Descent (PGD) based perturbation of the image, guided by embedding similarity constraints and with . GPT‑4o is used to generate pseudo-labels, which are manually verified, and the approach includes perturbing the query itself by concatenating words from the incorrect option; for Harmless questions, harmful content from existing visual adversarial resources is appended to induce unsafe outputs. Results indicate text-based attacks are more effective than image-only perturbations, with suffix injection exploiting language bias and adaptive strategies mitigating constraint violations; the authors acknowledge prompt misalignment as a key factor in observed discrepancies with external evaluations. The work highlights practical vulnerabilities in contemporary MLLMs and motivates defense strategies, prompting further exploration of robust prompt design and advanced PGD techniques.

Abstract

This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incorrectly labeled option (pseudo-labeled) to the original query as a suffix. Using this modified query, our second approach applies the PGD method to add imperceptible perturbations to the image. Combining these two techniques enables successful attacks on the LLaVA 1.5 model.

Paper Structure

This paper contains 7 sections, 2 equations, 1 figure, 2 tables, 1 algorithm.

Figures (1)

  • Figure 1: Attack performance w.r.t. varying suffix lengths. A suffix length of 0 indicates that no suffix injection is used to attack the LLaVA 1.5 model. It is important to note that the suffix length may be truncated prematurely due to text constraints or the end of the selected undesirable response.