Table of Contents
Fetching ...

Proactive defense against LLM Jailbreak

Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang

TL;DR

The paper addresses the vulnerability of large language models to evolving jailbreaking attacks, including multi-turn strategies. It introduces ProAct, a proactive three-agent framework that delivers spurious but safe-looking jailbreak outputs to mislead attackers' evaluators and prematurely terminate adversarial searches. Across diverse benchmarks, models, and attack strategies, ProAct achieves substantial reductions in attack success rates (up to 92%) and provides additive gains when combined with existing defenses, while preserving model utility. This work demonstrates a practical, orthogonal approach to strengthening LLM safety by disrupting the attack process itself rather than solely filtering downstream content.

Abstract

The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, primarily reactive and static, often fail to counter these search-based attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes. Our core idea is to intentionally provide adversaries with "spurious responses" that appear to be results of successful jailbreak attacks but contain no actual harmful content. These misleading responses provide false signals to the attacker's internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, our method consistently and significantly reduces attack success rates by up to 92\%. When combined with other defense frameworks, it further reduces the success rate of the latest attack strategies to 0\%. ProAct represents an orthogonal defense strategy that can serve as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.

Proactive defense against LLM Jailbreak

TL;DR

The paper addresses the vulnerability of large language models to evolving jailbreaking attacks, including multi-turn strategies. It introduces ProAct, a proactive three-agent framework that delivers spurious but safe-looking jailbreak outputs to mislead attackers' evaluators and prematurely terminate adversarial searches. Across diverse benchmarks, models, and attack strategies, ProAct achieves substantial reductions in attack success rates (up to 92%) and provides additive gains when combined with existing defenses, while preserving model utility. This work demonstrates a practical, orthogonal approach to strengthening LLM safety by disrupting the attack process itself rather than solely filtering downstream content.

Abstract

The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, primarily reactive and static, often fail to counter these search-based attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes. Our core idea is to intentionally provide adversaries with "spurious responses" that appear to be results of successful jailbreak attacks but contain no actual harmful content. These misleading responses provide false signals to the attacker's internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, our method consistently and significantly reduces attack success rates by up to 92\%. When combined with other defense frameworks, it further reduces the success rate of the latest attack strategies to 0\%. ProAct represents an orthogonal defense strategy that can serve as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.

Paper Structure

This paper contains 19 sections, 2 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Passive vs. Proactive Defence. We illustrate a jailbreak against GPT-OSS where the goal is to explain how to bypass the two-factor authentication system. Under a passive defence, repeated iterative attempts by the attacker eventually succeed. In contrast, proactive defence blocks the attack by returning a spurious response that misleads the attacker’s internal evaluator and stops the jailbreak in one turn. The highlighted content on the right appears malicious and task-relevant to the attacker but is actually benign and meaningless, crafted to convince the attacker’s evaluator that the model produced harmful output.
  • Figure 2: Overview of the ProAct Framework:ProAct consists of four stages. 1) a ① User Intent Analyser that assesses the maliciousness of the input using the current input with conversation history, and summarises the topic; 2) if the task is malicious, the ② ProAct Defender, equipped with encoding/misleading strategies, conditions on the topic and prior attempts to generate an effective, distinct spurious response 3) An ③ Surrogate Evaluator calls for regeneration until the response is considered malicious to is related topic. The success spurious response is then used as the final output; 4) If the task is benign, the base model’s raw response to the input query is returned.
  • Figure 3: ProAct Defending Jailbreaks with Spurious Response Strategies. Examples of harmful user requests (e.g., weapon assembly, phishing, social engineering, organ trade) are transformed into benign yet spurious responses using diverse encoding strategies such as Emoji substitution, Base64, Hex, and Morse code. These spurious responses appear harmful to the attacker’s evaluator but remain safe in content, effectively preventing further exploitation.
  • Figure 4: Effects of Backend Model Capacity across ProAct Components. We compare GPT-4.1-nano, GPT-4.1-mini, and GPT-5 as backend models for the User Intent Analyser, ProAct Defender, and Surrogate Evaluator. Reported metric is Attack Success Rate (ASR), where lower is better. Larger backend models substantially improve ProAct Defender performance, while the analyser and evaluator exhibit modest gains.