Table of Contents
Fetching ...

Dynamic Target Attack

Kedong Xiu, Churui Zeng, Tianhang Zheng, Xinzhe Huang, Xiaojun Jia, Di Wang, Puning Zhao, Zhan Qin, Kui Ren

TL;DR

Dynamic Target Attack (DTA) tackles the problem that fixed-target jailbreaks suffer from large target–distribution gaps in safety-aligned LLMs. By dynamically sampling harmful targets directly from the target model’s distribution under relaxed decoding and then briefly optimizing an adversarial suffix toward the chosen target, DTA reduces the discrepancy and speeds up convergence. The approach demonstrates strong gains in both white-box and black-box settings across multiple benchmarks, achieving higher attack success rates with substantially lower computational cost compared to prior methods. This method advances practical understanding of LLM jailbreak dynamics and offers a framework for evaluating and improving safety defenses, with open-source resources to replicate and extend the experiments.

Abstract

Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM's output distribution conditioned on diverse harmful inputs. Due to the substantial discrepancy between the target and the original output, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability target response from the target LLM. In this paper, we propose Dynamic Target Attack (DTA), a new jailbreaking framework relying on the target LLM's own responses as targets to optimize the adversarial prompts. In each optimization round, DTA iteratively samples multiple candidate responses directly from the output distribution conditioned on the current prompt, and selects the most harmful response as a temporary target for prompt optimization. In contrast to existing attacks, DTA significantly reduces the discrepancy between the target and the output distribution, substantially easing the optimization process to search for an effective adversarial prompt. Extensive experiments demonstrate the superior effectiveness and efficiency of DTA: under the white-box setting, DTA only needs 200 optimization iterations to achieve an average attack success rate (ASR) of over 87\% on recent safety-aligned LLMs, exceeding the state-of-the-art baselines by over 15\%. The time cost of DTA is 2-26 times less than existing baselines. Under the black-box setting, DTA uses Llama-3-8B-Instruct as a surrogate model for target sampling and achieves an ASR of 85\% against the black-box target model Llama-3-70B-Instruct, exceeding its counterparts by over 25\%.

Dynamic Target Attack

TL;DR

Dynamic Target Attack (DTA) tackles the problem that fixed-target jailbreaks suffer from large target–distribution gaps in safety-aligned LLMs. By dynamically sampling harmful targets directly from the target model’s distribution under relaxed decoding and then briefly optimizing an adversarial suffix toward the chosen target, DTA reduces the discrepancy and speeds up convergence. The approach demonstrates strong gains in both white-box and black-box settings across multiple benchmarks, achieving higher attack success rates with substantially lower computational cost compared to prior methods. This method advances practical understanding of LLM jailbreak dynamics and offers a framework for evaluating and improving safety defenses, with open-source resources to replicate and extend the experiments.

Abstract

Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM's output distribution conditioned on diverse harmful inputs. Due to the substantial discrepancy between the target and the original output, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability target response from the target LLM. In this paper, we propose Dynamic Target Attack (DTA), a new jailbreaking framework relying on the target LLM's own responses as targets to optimize the adversarial prompts. In each optimization round, DTA iteratively samples multiple candidate responses directly from the output distribution conditioned on the current prompt, and selects the most harmful response as a temporary target for prompt optimization. In contrast to existing attacks, DTA significantly reduces the discrepancy between the target and the output distribution, substantially easing the optimization process to search for an effective adversarial prompt. Extensive experiments demonstrate the superior effectiveness and efficiency of DTA: under the white-box setting, DTA only needs 200 optimization iterations to achieve an average attack success rate (ASR) of over 87\% on recent safety-aligned LLMs, exceeding the state-of-the-art baselines by over 15\%. The time cost of DTA is 2-26 times less than existing baselines. Under the black-box setting, DTA uses Llama-3-8B-Instruct as a surrogate model for target sampling and achieves an ASR of 85\% against the black-box target model Llama-3-70B-Instruct, exceeding its counterparts by over 25\%.

Paper Structure

This paper contains 28 sections, 13 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: DTA directly samples $T_{\text{sampled}}$ from the LLM, which is more probable than $T_{\text{fixed}}$.
  • Figure 2: Overview of DTA. DTA progressively executes an "sampling-optimization cycle" to directly sample the inherent harmful response from the target LLM's relatively high-probability generation regions and optimize the adversarial suffix. Algorithm \ref{['alg:dta']} shows the details of our DTA.
  • Figure 3: Illustration of the core difference between DTA and existing methods.
  • Figure 4: Comparison result of DTA and baselines on HarmBench. Dark (Light) bar denotes the average (maximum) ASRs across five target LLMs.
  • Figure 5: Comparison result of DTA and baselines on AdvBench. Dark (Light) bar denotes the average (maximum) ASRs across five target LLMs.
  • ...and 2 more figures