Table of Contents
Fetching ...

Understanding and Enhancing the Transferability of Jailbreaking Attacks

Runqi Lin, Bo Han, Fengwang Li, Tongling Liu

TL;DR

This work investigates the transferability of jailbreaking attacks on large language models through the lens of intent perception. It demonstrates that while attacks can mislead the source model, their effect often fails to transfer to target proprietary LLMs due to distributional dependency arising from overfitting to the source model's sampling process. To address this, the authors propose Perceived-importance Flatten (PiF), which uniformly shifts attention to neutral tokens, uses a dynamic optimization objective, and relies on synonym substitutions rather than lengthy adversarial sequences. PiF achieves high attack success on diverse models (ASR up to ~97% on GPT-4) and reduces transfer-related variance, offering a practical red-teaming evaluation method for identifying vulnerabilities in proprietary LLMs. The study highlights implications for model alignment, safety defenses, and future cross-domain red-teaming across multimodal models.

Abstract

Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. By incorporating adversarial sequences, these attacks can redirect the source LLM's focus away from malicious-intent tokens in the original input, thereby obstructing the model's intent recognition and eliciting harmful responses. Nevertheless, these adversarial sequences fail to mislead the target LLM's intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent distributional dependency within the generated adversarial sequences, whose effectiveness stems from overfitting the source LLM's parameters, resulting in limited transferability to target LLMs. To this end, we propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input, thus obscuring malicious-intent tokens without relying on overfitted adversarial sequences. Extensive experiments demonstrate that PiF provides an effective and efficient red-teaming evaluation for proprietary LLMs.

Understanding and Enhancing the Transferability of Jailbreaking Attacks

TL;DR

This work investigates the transferability of jailbreaking attacks on large language models through the lens of intent perception. It demonstrates that while attacks can mislead the source model, their effect often fails to transfer to target proprietary LLMs due to distributional dependency arising from overfitting to the source model's sampling process. To address this, the authors propose Perceived-importance Flatten (PiF), which uniformly shifts attention to neutral tokens, uses a dynamic optimization objective, and relies on synonym substitutions rather than lengthy adversarial sequences. PiF achieves high attack success on diverse models (ASR up to ~97% on GPT-4) and reduces transfer-related variance, offering a practical red-teaming evaluation method for identifying vulnerabilities in proprietary LLMs. The study highlights implications for model alignment, safety defenses, and future cross-domain red-teaming across multimodal models.

Abstract

Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. By incorporating adversarial sequences, these attacks can redirect the source LLM's focus away from malicious-intent tokens in the original input, thereby obstructing the model's intent recognition and eliciting harmful responses. Nevertheless, these adversarial sequences fail to mislead the target LLM's intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent distributional dependency within the generated adversarial sequences, whose effectiveness stems from overfitting the source LLM's parameters, resulting in limited transferability to target LLMs. To this end, we propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input, thus obscuring malicious-intent tokens without relying on overfitted adversarial sequences. Extensive experiments demonstrate that PiF provides an effective and efficient red-teaming evaluation for proprietary LLMs.

Paper Structure

This paper contains 25 sections, 1 equation, 4 figures, 14 tables, 1 algorithm.

Figures (4)

  • Figure 1: The effectiveness of jailbreaking attacks. These attacks are initially generated on the source LLM (Llama-2-7B-Chat) and subsequently transferred to the target LLM (Llama-2-13B-Chat). For token-level and prompt-level jailbreaks, we employ the GCG and PAIR attacks as baseline methods.
  • Figure 2: The model's intent perception on the original input, as well as GCG and PAIR attacks. Unaligned perceived-importance (PI) is assessed on the Llama-2-7B. Source and target PI are measured on Llama-2-7B-Chat and Llama-2-13B-Chat, respectively.
  • Figure 3: The model's intent perception on the swapped-order GCG and PAIR attacks. The source perceived-importance (PI) is measured on the Llama-2-7B-Chat.
  • Figure 4: The procedure of Perceived-importance Flatten (PiF) Method. Source and target perceived-importance (PI) are evaluated on Bert-Large and Llama-2-13B-Chat, respectively.