TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu; Jiaqi Li; Xiaotong Zhang; Hong Yu; Han Liu

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu

TL;DR

TAO-Attack is proposed, a new optimization-based jailbreak method for large language models that consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

Abstract

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

TL;DR

Abstract

Paper Structure (50 sections, 1 theorem, 33 equations, 5 figures, 17 tables, 1 algorithm)

This paper contains 50 sections, 1 theorem, 33 equations, 5 figures, 17 tables, 1 algorithm.

Introduction
Related Work
Methodology
Problem Formulation
Two-Stage Loss Function
Stage One: Refusal-Aware Loss
Stage Two: Effectiveness-Aware Loss
Final Loss Function
Direction--Priority Token Optimization
Rethinking GCG
The DPTO Strategy
Experiments
Experimental Settings
Datasets
Models
...and 35 more sections

Key Result

Lemma A.1

For any $v\in\mathcal{C}_i$,

Figures (5)

Figure 1: Comparison of optimization-based jailbreak attacks. GCG and MAC can result in refusals, while $\mathcal{I}$-GCG reduces refusals but produces pseudo-harmful outputs. Our TAO-Attack employs a two-stage loss to suppress refusals and penalize pseudo-harmful outputs, leading to more effective harmful completions.
Figure 2: Illustration of the token optimization. GCG prefers $\mathbf{e}_j$ due to its large step size, even though it deviates from the negative gradient direction (red arrow). Our method instead selects $\mathbf{e}_l$, which achieves both strong alignment with the negative gradient and a sufficient step size.
Figure 3: Average loss for GCG (Softmax) and DPTO, with shaded areas indicating standard deviation, computed over samples where both methods succeed.
Figure 4: Evaluation prompt for GPT-4-based (GPT-4 Turbo) checking.
Figure 5: The universal jailbreaking suffix prompts response generation from GPT-3.5 Turbo (left) and GPT-4 Turbo (right) on the OpenRouter platform.

Theorems & Definitions (1)

Lemma A.1: Cone constraint

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

TL;DR

Abstract

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (1)