Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM
Yangyang Guo, Ziwei Xu, Xilie Xu, YongKang Wong, Liqiang Nie, Mohan Kankanhalli
TL;DR
The paper addresses vulnerabilities of Multi-Modal Large Language Models (MLLMs) to attacks that compromise Helpfulness, Honesty, and Harmlessness (H1–H3) of outputs, focusing on LLaVA 1.5. It proposes a two‑stage attack combining suffix injection (adding an incorrect pseudo-label to the query under a length and similarity constraint) with a Projected Gradient Descent (PGD) based perturbation of the image, guided by embedding similarity constraints $sim(\mathbf{v}_{cle}, \mathbf{v}_{adv}) > \beta_v$ and $sim(\mathbf{q}_{cle}, \mathbf{q}_{adv}) > \beta_q$ with $\beta_v=\beta_q=0.9$. GPT‑4o is used to generate pseudo-labels, which are manually verified, and the approach includes perturbing the query itself by concatenating words from the incorrect option; for Harmless questions, harmful content from existing visual adversarial resources is appended to induce unsafe outputs. Results indicate text-based attacks are more effective than image-only perturbations, with suffix injection exploiting language bias and adaptive strategies mitigating constraint violations; the authors acknowledge prompt misalignment as a key factor in observed discrepancies with external evaluations. The work highlights practical vulnerabilities in contemporary MLLMs and motivates defense strategies, prompting further exploration of robust prompt design and advanced PGD techniques.
Abstract
This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incorrectly labeled option (pseudo-labeled) to the original query as a suffix. Using this modified query, our second approach applies the PGD method to add imperceptible perturbations to the image. Combining these two techniques enables successful attacks on the LLaVA 1.5 model.
