Table of Contents
Fetching ...

Misusing Tools in Large Language Models With Visual Adversarial Examples

Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K. Gupta, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Earlence Fernandes

TL;DR

The paper reveals a security risk in multimodal LLMs that integrate tools by showing that carefully crafted visual adversarial examples can coerce the model into invoking attacker-chosen tools, potentially harming user resources. It introduces a gradient-based, white-box adversarial image optimization framework that balances stealth and attack success, and generalizes across unseen prompts using prompt–response data from GPT-4 and Alpaca. Empirical results on an open-source multimodal LLM (LLaMA Adapter) show high attack success and image similarity while preserving conversational quality, with measurable declines in user-perceived utility and transfer to out-domain prompts. The work highlights practical defense implications and calls for guarded tool authorization, while acknowledging limitations and proposing future work on black-box transferability and broader model evaluation.

Abstract

Large Language Models (LLMs) are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels. Different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the LLM while being stealthy and generalizable to multiple input prompts. We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. We find that our adversarial images can manipulate the LLM to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to clean images (~0.9 SSIM). Furthermore, using human scoring and automated metrics, we find that the attacks do not noticeably affect the conversation (and its semantics) between the user and the LLM.

Misusing Tools in Large Language Models With Visual Adversarial Examples

TL;DR

The paper reveals a security risk in multimodal LLMs that integrate tools by showing that carefully crafted visual adversarial examples can coerce the model into invoking attacker-chosen tools, potentially harming user resources. It introduces a gradient-based, white-box adversarial image optimization framework that balances stealth and attack success, and generalizes across unseen prompts using prompt–response data from GPT-4 and Alpaca. Empirical results on an open-source multimodal LLM (LLaMA Adapter) show high attack success and image similarity while preserving conversational quality, with measurable declines in user-perceived utility and transfer to out-domain prompts. The work highlights practical defense implications and calls for guarded tool authorization, while acknowledging limitations and proposing future work on black-box transferability and broader model evaluation.

Abstract

Large Language Models (LLMs) are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels. Different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the LLM while being stealthy and generalizable to multiple input prompts. We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. We find that our adversarial images can manipulate the LLM to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to clean images (~0.9 SSIM). Furthermore, using human scoring and automated metrics, we find that the attacks do not noticeably affect the conversation (and its semantics) between the user and the LLM.
Paper Structure (21 sections, 3 equations, 5 figures, 7 tables)

This paper contains 21 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An example of our attack. The benign-looking adversarial image manipulates the model to generate malicious tool invocations (in red) as we specified under different conversation contexts in addition to a normal response. The tool invocation text will not be printed out in practice since they will be directly processed as function calls (see ChatGPT).
  • Figure 2: Overall architecture of our attack method. We train the targeted image using gradient-based optimization, and separate the loss term into three components, aiming at keeping perturbations imperceptible, maintaining response utility, and achieving malicious behavior respectively.
  • Figure 3: Illustration of various cases of attacks. Note that the texts marked in red, same as in Figure \ref{['fig:demo']}, are tool invocations that will not be printed out and are invisible to users.
  • Figure 4: The three images used for attack evaluations.
  • Figure 5: Three adversarial images of various SSIM Index generated from the same base image. Image (b), (f), (g) are from delete_email in Table \ref{['tab:main_exps']}. Image (c), (d) are from book_ticket in Table \ref{['tab:hard_attacks']}.