Table of Contents
Fetching ...

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Gunnemann

TL;DR

This paper introduces embedding-space attacks for open-source LLMs, showing that perturbing continuous input embeddings can bypass safety alignment more efficiently than discrete attacks or fine-tuning. It formalizes the threat model, implements universal, individual, and multi-layer variants, and demonstrates strong attack success across multiple models and datasets, including bypassing circuit breakers and retrieving supposedly forgotten or even training data. The work also frames unlearning as an interrogation tool, revealing residual knowledge in unlearned models and exposing risks of data extraction from pretrained models. Together, the results highlight embedding-space attacks as a critical and scalable threat model for open-source LLM safety, urging development of robust defenses and responsible deployment practices.

Abstract

Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety also becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models. Our findings highlight embedding space attacks as an important threat model in open-source LLMs. Trigger Warning: the appendix contains LLM-generated text with violence and harassment.

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

TL;DR

This paper introduces embedding-space attacks for open-source LLMs, showing that perturbing continuous input embeddings can bypass safety alignment more efficiently than discrete attacks or fine-tuning. It formalizes the threat model, implements universal, individual, and multi-layer variants, and demonstrates strong attack success across multiple models and datasets, including bypassing circuit breakers and retrieving supposedly forgotten or even training data. The work also frames unlearning as an interrogation tool, revealing residual knowledge in unlearned models and exposing risks of data extraction from pretrained models. Together, the results highlight embedding-space attacks as a critical and scalable threat model for open-source LLM safety, urging development of robust defenses and responsible deployment practices.

Abstract

Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety also becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models. Our findings highlight embedding space attacks as an important threat model in open-source LLMs. Trigger Warning: the appendix contains LLM-generated text with violence and harassment.
Paper Structure (28 sections, 4 equations, 9 figures, 6 tables)

This paper contains 28 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustration of discrete and embedding space attacks (this work). Discrete attacks manipulate discrete one-hot tokens $T_{adv} \in \mathcal{T}$, whereas embedding space.
  • Figure 2: We use a similar setting as in zou2023universal with the difference of optimizing attacks in the embedding space. Given an <instruction>, an adversarial embedding is optimized to trigger an affirmative <target> response, with the <goal> of triggering a subsequent generation related to the target. The <goal> is not provided during attack optimization.
  • Figure 3: Illustration of the multi-layer attack. From a regular generated sequence $T_k^L$, we decode alternative output sequence $T_k^l$ from intermediate layers of the neural network.
  • Figure 4: Attack success rate and average compute time of diverse discrete attacks and the proposed embedding attack for different models. Embedding attacks achieve higher success rates and are considerably more efficient compared to existing methods for all tested models.
  • Figure 5: The two rows show the perplexity and toxicity (obtained from toxic-bert) of generated responses of different LLMs with and without embedding space attacks on the harmful behavior dataset. Additionally, the scores of the fine-tuned Llama2 model are compared to attacking the regular Llama2. Embedding attacks decrease perplexity for all models while significantly increasing toxicity for most models (significant differences with a Mann–Whitney U test are indicated with *).
  • ...and 4 more figures