Table of Contents
Fetching ...

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu

TL;DR

TAO-Attack is proposed, a new optimization-based jailbreak method for large language models that consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

Abstract

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

TL;DR

TAO-Attack is proposed, a new optimization-based jailbreak method for large language models that consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

Abstract

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.
Paper Structure (50 sections, 1 theorem, 33 equations, 5 figures, 17 tables, 1 algorithm)

This paper contains 50 sections, 1 theorem, 33 equations, 5 figures, 17 tables, 1 algorithm.

Key Result

Lemma A.1

For any $v\in\mathcal{C}_i$,

Figures (5)

  • Figure 1: Comparison of optimization-based jailbreak attacks. GCG and MAC can result in refusals, while $\mathcal{I}$-GCG reduces refusals but produces pseudo-harmful outputs. Our TAO-Attack employs a two-stage loss to suppress refusals and penalize pseudo-harmful outputs, leading to more effective harmful completions.
  • Figure 2: Illustration of the token optimization. GCG prefers $\mathbf{e}_j$ due to its large step size, even though it deviates from the negative gradient direction (red arrow). Our method instead selects $\mathbf{e}_l$, which achieves both strong alignment with the negative gradient and a sufficient step size.
  • Figure 3: Average loss for GCG (Softmax) and DPTO, with shaded areas indicating standard deviation, computed over samples where both methods succeed.
  • Figure 4: Evaluation prompt for GPT-4-based (GPT-4 Turbo) checking.
  • Figure 5: The universal jailbreaking suffix prompts response generation from GPT-3.5 Turbo (left) and GPT-4 Turbo (right) on the OpenRouter platform.

Theorems & Definitions (1)

  • Lemma A.1: Cone constraint