Table of Contents
Fetching ...

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Yanting Wang, Runpeng Geng, Jinghui Chen, Minhao Cheng, Jinyuan Jia

TL;DR

TASO addresses jailbreaking of safety-aligned LLMs by jointly optimizing a jailbreak template and a suffix in an alternating loop. The method leverages suffix optimization to influence the initial output tokens and template optimization to guide the entire response, mitigating overfitting and increasing overall success. Empirical evaluation across HarmBench and AdvBench on 24 LLMs shows TASO achieving high attack success rates and robustness to defenses, with extensive ablations illustrating the individual and combined benefits of both components. The work highlights the practical significance of integrating template- and suffix-based strategies and points to future work on efficiency and broader attack vectors with implications for defense design.

Abstract

Many recent studies showed that LLMs are vulnerable to jailbreak attacks, where an attacker can perturb the input of an LLM to induce it to generate an output for a harmful question. In general, existing jailbreak techniques either optimize a semantic template intended to induce the LLM to produce harmful outputs or optimize a suffix that leads the LLM to initiate its response with specific tokens (e.g., "Sure"). In this work, we introduce TASO (Template and Suffix Optimization), a novel jailbreak method that optimizes both a template and a suffix in an alternating manner. Our insight is that suffix optimization and template optimization are complementary to each other: suffix optimization can effectively control the first few output tokens but cannot control the overall quality of the output, while template optimization provides guidance for the entire output but cannot effectively control the initial tokens, which significantly impact subsequent responses. Thus, they can be combined to improve the attack's effectiveness. We evaluate the effectiveness of TASO on benchmark datasets (including HarmBench and AdvBench) on 24 leading LLMs (including models from the Llama family, OpenAI, and DeepSeek). The results demonstrate that TASO can effectively jailbreak existing LLMs. We hope our work can inspire future studies in exploring this direction.

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

TL;DR

TASO addresses jailbreaking of safety-aligned LLMs by jointly optimizing a jailbreak template and a suffix in an alternating loop. The method leverages suffix optimization to influence the initial output tokens and template optimization to guide the entire response, mitigating overfitting and increasing overall success. Empirical evaluation across HarmBench and AdvBench on 24 LLMs shows TASO achieving high attack success rates and robustness to defenses, with extensive ablations illustrating the individual and combined benefits of both components. The work highlights the practical significance of integrating template- and suffix-based strategies and points to future work on efficiency and broader attack vectors with implications for defense design.

Abstract

Many recent studies showed that LLMs are vulnerable to jailbreak attacks, where an attacker can perturb the input of an LLM to induce it to generate an output for a harmful question. In general, existing jailbreak techniques either optimize a semantic template intended to induce the LLM to produce harmful outputs or optimize a suffix that leads the LLM to initiate its response with specific tokens (e.g., "Sure"). In this work, we introduce TASO (Template and Suffix Optimization), a novel jailbreak method that optimizes both a template and a suffix in an alternating manner. Our insight is that suffix optimization and template optimization are complementary to each other: suffix optimization can effectively control the first few output tokens but cannot control the overall quality of the output, while template optimization provides guidance for the entire output but cannot effectively control the initial tokens, which significantly impact subsequent responses. Thus, they can be combined to improve the attack's effectiveness. We evaluate the effectiveness of TASO on benchmark datasets (including HarmBench and AdvBench) on 24 leading LLMs (including models from the Llama family, OpenAI, and DeepSeek). The results demonstrate that TASO can effectively jailbreak existing LLMs. We hope our work can inspire future studies in exploring this direction.

Paper Structure

This paper contains 30 sections, 3 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Example of a jailbreak attack. The LLM initially refuses to respond when directly prompted with the harmful query. However, it provides a harmful answer when the user uses an optimized jailbreak prompt instead.
  • Figure 2: Example jailbreak prompt generated by template optimization and suffix optimization.
  • Figure 3: An overview of our method. At each iteration, we initialize a jailbreak prompt by appending an initialized suffix to the current template. We first optimize the suffix, then reattach this optimized suffix to the template to generate random responses from the target LLM. Finally, we refine the template based on those responses, and continue to the next iteration.
  • Figure 4: Example failure behaviors even when the output starts with the target token.
  • Figure 5: An initial template adapted from previous work andriushchenko2024adaptive.
  • ...and 7 more figures