Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Leo Schwinn; David Dobre; Stephan Günnemann; Gauthier Gidel

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Leo Schwinn, David Dobre, Stephan Günnemann, Gauthier Gidel

TL;DR

The paper investigates the fragile robustness of Large Language Models (LLMs) and the prevalence of flawed defense evaluations that risk overestimating protection. It argues for NLP-specific prerequisites, including clearly defined threat models and standardized benchmarks, and introduces embedding-space attacks as a practical threat model for open-source LLMs. Through analysis of a recent defense, it demonstrates how robustness claims can be circumvented under relaxed threat-model assumptions, and it highlights the efficiency of embedding-space attacks (e.g., rapid trigger formation on open-source models). The work calls for rigorous evaluation guidelines and threat-model design to curb the looming adversarial arms race and safeguard real-world deployments.

Abstract

Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastly unsolved. Here, one major impediment has been the overestimation of the robustness of new defense approaches due to faulty defense evaluations. Flawed robustness evaluations necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security. In this context, we will face substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source Large Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic's Claude. We provide a first set of prerequisites to improve the robustness assessment of new approaches and reduce the amount of faulty evaluations. Additionally, we identify embedding space attacks on LLMs as another viable threat model for the purposes of generating malicious content in open-sourced models. Finally, we demonstrate on a recently proposed defense that, without LLM-specific best practices in place, it is easy to overestimate the robustness of a new approach.

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 5 figures)

This paper contains 11 sections, 1 equation, 5 figures.

Introduction
Related work
The adversarial arms race in LLMs
A first set of prerequisites for accurate defense evaluations
Benchmarks
Threat model dimensions
Embedding attacks
Circumventing a defense
Conclusion
Ethics Statement
Embedding attack examples

Figures (5)

Figure 1: An example of the output of the Llama2-7b chat model touvron2023llama produces when given the fixed user prompt in blue, and optimizing the token embeddings in red to produce the text in bold with an embedding attack. Since we only optimize in embedding space, there is no corresponding string for the adversarial attack to map to. Prompt inspired by carlini2023aligned, and the attack was run for 500 steps.
Figure 2: Output of the Llama2-7b chat model touvron2023llama when given a fixed user prompt in blue and we optimize the tokens in red to produce the text in bold with an embedding attack. Since we only optimize in embedding space, there is no corresponding string for the adversarial attack to map to. Prompt taken from AdvBenchzou2023universal, and the attack was run for 500 steps.
Figure 3: Output of the Llama2-7b chat model touvron2023llama when given a fixed user prompt in blue and we optimize the tokens in red to produce the text in bold with an embedding attack. Since we optimize in embedding space, there is no corresponding string for the adversarial attack to map to. Prompt from AdvBenchzou2023universal, and the attack was run for 500 steps.
Figure 4: Output of the Llama2-7b chat model touvron2023llama when given the user prompt in red which was optimized via an embedding space attack to produce the text in bold. We freely optimize over all input tokens, which is distinct from previous cases where we kept a fixed user prompt (specified in blue) and only optimized a subset of control tokens. As before, there is no corresponding string for the adversarial attack to map to. Prompt inspired by carlini2023aligned, and the attack was run for 500 steps.
Figure 5: An example of the output of the Llama2-7b chat model touvron2023llama, where the malicious response derails after a few words. Prompt inspired by carlini2023aligned, and the attack was run for 500 steps.

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

TL;DR

Abstract

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Authors

TL;DR

Abstract

Table of Contents

Figures (5)