Fast Adversarial Attacks on Language Models In One GPU Minute

Vinu Sankar Sadasivan; Shoumik Saha; Gaurang Sriramanan; Priyatham Kattakinda; Atoosa Chegini; Soheil Feizi

Fast Adversarial Attacks on Language Models In One GPU Minute

Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi

TL;DR

This paper presents BEAST, a gradient-free, beam-search-based adversarial attack that runs in a minute on a single GPU to rapidly manipulate language models. BEAST targets three attack modalities—jailbreaking, hallucination elicitation, and privacy leakage—demonstrating strong jailbreaking performance across multiple models, the ability to induce hallucinations via untargeted prompts, and enhanced membership inference attacks when combined with existing tools. The method relies on interpretable hyperparameters (beam size and top-k sampling) to trade off attack speed, payload readability, and success rate, and it achieves competitive or superior results versus gradient-based baselines under compute-constrained settings. The work highlights significant security and privacy implications for LM deployments and provides open-source code to accelerate further research in LM security.

Abstract

In this paper, we introduce a novel class of fast, beam search-based adversarial attack (BEAST) for Language Models (LMs). BEAST employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. The computational efficiency of BEAST facilitates us to investigate its applications on LMs for jailbreaking, eliciting hallucinations, and privacy attacks. Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89% when compared to a gradient-based baseline that takes over an hour to achieve 70% success rate using a single Nvidia RTX A6000 48GB GPU. Additionally, we discover a unique outcome wherein our untargeted attack induces hallucinations in LM chatbots. Through human evaluations, we find that our untargeted attack causes Vicuna-7B-v1.5 to produce ~15% more incorrect outputs when compared to LM outputs in the absence of our attack. We also learn that 22% of the time, BEAST causes Vicuna to generate outputs that are not relevant to the original prompt. Further, we use BEAST to generate adversarial prompts in a few seconds that can boost the performance of existing membership inference attacks for LMs. We believe that our fast attack, BEAST, has the potential to accelerate research in LM security and privacy. Our codebase is publicly available at https://github.com/vinusankars/BEAST.

Fast Adversarial Attacks on Language Models In One GPU Minute

TL;DR

Abstract

Paper Structure (27 sections, 3 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 3 equations, 6 figures, 8 tables, 1 algorithm.

Introduction
Related Works
Beam Search-based Adversarial Attack
Preliminaries
Our Threat Model
Our Method: BEAST
Jailbreaking Attacks
Setup
Baselines
Evaluation Methods
Results
Multiple Behaviour and Transferability
Hallucination Attacks
Setup
Results
...and 12 more sections

Figures (6)

Figure 1: An overview of our method Beam Search-based Adversarial Attack (BEAST). Top panel: Depiction of how our method utilizes beam search for adversarially attacking LMs. At every attack iteration $(i+1)$, we maintain $k_1$ elements in our beam. The target LM multinomially samples $k_2$ tokens for each of the beam elements. These tokens are appended to the corresponding beam elements to generate a total of $k_1 \times k_2$ candidates. Each of the candidates is scored using an adversarial objective $\mathcal{L}$. The best $k_1$ candidates with the lowest adversarial scores are maintained in the beam and carried forward to the next attack iteration. Bottom panel: We demonstrate that our fast attacks can be used for various applications. (i) Left: In §\ref{['sec:jailbreak']}, we find that BEAST can efficiently jailbreak a variety of LM chatbots by appending adversarial tokens based on a targeted attack objective $\mathcal{L}$. (ii) Center: In §\ref{['sec:hallucination']}, we show that we can successfully elevate hallucinations in aligned LMs based on an untargeted adversarial objective. (iii) Right: §\ref{['sec:mia']} demonstrates that BEAST can be used to improve the performance of existing tools used for membership inference attacks by generating adversarial prompts based on an untargeted attack objective.
Figure 2: Tradeoff between ASR and time for BEAST on Vicuna-7B, by varying our attack parameter $k$. We get 98% ASR in 2.65 minutes, while we get 66% ASR in just 10 seconds.
Figure 3: Hallucination attack evaluation using human and automated studies. Figure \ref{['fig:ratio-human']} shows the relative attack advantage and inconsistency caused by BEAST using MTurk human study on Vicuna-7B-v1.5 and LLaMA-2-7B-Chat. Figure \ref{['fig:ratio-gpt']} shows the same automatically evaluated using GPT-4-Turbo. BEAST illicits hallucination behavior in aligned LMs, as consistently indicated by both the hallucination detection studies that we perform.
Figure 4: Screenshot showing the format and questions asked in human study for evaluating jailbreaking.
Figure 5: Screenshot showing the format and questions asked in human study for evaluating hallucination.
...and 1 more figures

Fast Adversarial Attacks on Language Models In One GPU Minute

TL;DR

Abstract

Fast Adversarial Attacks on Language Models In One GPU Minute

Authors

TL;DR

Abstract

Table of Contents

Figures (6)