Table of Contents
Fetching ...

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings

Hao Wang, Hao Li, Minlie Huang, Lei Sha

TL;DR

The paper addresses the evolving risk of jailbreak prompts in LLMs by introducing ASETF, a two-stage framework that first optimizes continuous adversarial suffix embeddings and then translates them into fluent text via an embedding translation model. This approach dramatically reduces computational cost relative to discrete-token methods and yields higher attack success rates with improved prompt fluency, while enabling transferable attacks across multiple LLMs, including black-box systems. Key contributions include the continuous embedding optimization with MMD loss, a self-supervised embedding translation framework, and demonstrated improvements in efficiency, fluency, and cross-model transferability on models such as Llama2 and Vicuna using Advbench data. The work offers a practical, plug-and-play pathway for adversarial suffix generation and informs defense design by highlighting vulnerabilities that persist under paraphrasing and cross-model transfer, with implications for strengthening LLM safety defenses.

Abstract

The safety defense methods of Large language models(LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters. To cope with this challenge, in this paper, we proposes an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent and understandable text. This method greatly reduces the computational overhead during the attack process and helps to automatically generate multiple adversarial samples, which can be used as data to strengthen LLMs security defense. Experimental evaluations were conducted on Llama2, Vicuna, and other prominent LLMs, employing harmful directives sourced from the Advbench dataset. The results indicate that our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques, while significantly enhancing the textual fluency of the prompts. In addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini.

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings

TL;DR

The paper addresses the evolving risk of jailbreak prompts in LLMs by introducing ASETF, a two-stage framework that first optimizes continuous adversarial suffix embeddings and then translates them into fluent text via an embedding translation model. This approach dramatically reduces computational cost relative to discrete-token methods and yields higher attack success rates with improved prompt fluency, while enabling transferable attacks across multiple LLMs, including black-box systems. Key contributions include the continuous embedding optimization with MMD loss, a self-supervised embedding translation framework, and demonstrated improvements in efficiency, fluency, and cross-model transferability on models such as Llama2 and Vicuna using Advbench data. The work offers a practical, plug-and-play pathway for adversarial suffix generation and informs defense design by highlighting vulnerabilities that persist under paraphrasing and cross-model transfer, with implications for strengthening LLM safety defenses.

Abstract

The safety defense methods of Large language models(LLMs) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. Recent studies found that attaching suffixes to harmful instructions can hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effective, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters. To cope with this challenge, in this paper, we proposes an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming continuous adversarial suffix embeddings into coherent and understandable text. This method greatly reduces the computational overhead during the attack process and helps to automatically generate multiple adversarial samples, which can be used as data to strengthen LLMs security defense. Experimental evaluations were conducted on Llama2, Vicuna, and other prominent LLMs, employing harmful directives sourced from the Advbench dataset. The results indicate that our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques, while significantly enhancing the textual fluency of the prompts. In addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini.
Paper Structure (26 sections, 15 equations, 8 figures, 7 tables)

This paper contains 26 sections, 15 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: This is a conceptual sketch of our method, we first obtain adversarial suffixes embedding through gradient based optimization, and then use an embedding translation model to convert the obtained suffixes into fluent text with almost no change in embedding.
  • Figure 2: The illustration of the Embedding Translation Framework. (a) Single target: The context is mapped into embedding space by the translate LLM's embedding lookup layer, while the suffix is mapped into embedding space by the target LLM's lookup layer for adaptation. The goal is to successfully translate the adapted suffix back into the original text. (b) Multiple targets: The embedding lookup layers of multiple target LLM are integrated so the translated suffix can universally attack all targets even black-box target LLMs.
  • Figure 3: A case in attack LLMs that only provide APIs or web services.
  • Figure 4: Comparison chart of embedding before and after translation for a set of data represented by the same shape, with red indicating before translation and blue indicating after translation
  • Figure 5: A visual explanation diagram of MMD loss, where blue dots represent the optimized vector and red x marker represent the word embedding vectors of the to-be-attacked model
  • ...and 3 more figures