Table of Contents
Fetching ...

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

Xianglin Yang, Gelei Deng, Jieming Shi, Tianwei Zhang, Jin Song Dong

TL;DR

This work tackles the vulnerability of large language models to jailbreak prompts by introducing Safety Chain-of-Thought (SCoT), a proactive, reasoning-based defense. SCoT uses a three-stage process—Verify intent, apply safety reasoning, and Respond with a structured refusal—to assess harmful inputs before answering, coupled with question evolution and supervised fine-tuning. Experimental results show near-zero jailbreak attack success rates across multiple attack types and datasets, with only modest trade-offs in general capabilities compared to existing defenses like Circuitbreaker. The approach demonstrates improved robustness to out-of-distribution and adversarial inputs, confirms the value of proactive safety reasoning, and provides open resources to foster further safety research and development.

Abstract

Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats, which could lead to the generation of inappropriate responses. Conventional defenses, such as refusal and adversarial training, often fail to cover corner cases or rare domains, leaving LLMs still vulnerable to more sophisticated attacks. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced \textit{reasoning capabilities} of LLMs for proactive assessment of harmful inputs, rather than simply blocking them. SCoT augments any refusal training datasets to critically analyze the intent behind each request before generating answers. By employing proactive reasoning, SCoT enhances the generalization of LLMs across varied harmful queries and scenarios not covered in the safety alignment corpus. Additionally, it generates detailed refusals specifying the rules violated. Comparative evaluations show that SCoT significantly surpasses existing defenses, reducing vulnerability to out-of-distribution issues and adversarial manipulations while maintaining strong general capabilities.

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

TL;DR

This work tackles the vulnerability of large language models to jailbreak prompts by introducing Safety Chain-of-Thought (SCoT), a proactive, reasoning-based defense. SCoT uses a three-stage process—Verify intent, apply safety reasoning, and Respond with a structured refusal—to assess harmful inputs before answering, coupled with question evolution and supervised fine-tuning. Experimental results show near-zero jailbreak attack success rates across multiple attack types and datasets, with only modest trade-offs in general capabilities compared to existing defenses like Circuitbreaker. The approach demonstrates improved robustness to out-of-distribution and adversarial inputs, confirms the value of proactive safety reasoning, and provides open resources to foster further safety research and development.

Abstract

Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats, which could lead to the generation of inappropriate responses. Conventional defenses, such as refusal and adversarial training, often fail to cover corner cases or rare domains, leaving LLMs still vulnerable to more sophisticated attacks. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced \textit{reasoning capabilities} of LLMs for proactive assessment of harmful inputs, rather than simply blocking them. SCoT augments any refusal training datasets to critically analyze the intent behind each request before generating answers. By employing proactive reasoning, SCoT enhances the generalization of LLMs across varied harmful queries and scenarios not covered in the safety alignment corpus. Additionally, it generates detailed refusals specifying the rules violated. Comparative evaluations show that SCoT significantly surpasses existing defenses, reducing vulnerability to out-of-distribution issues and adversarial manipulations while maintaining strong general capabilities.

Paper Structure

This paper contains 23 sections, 1 equation, 3 figures, 13 tables, 1 algorithm.

Figures (3)

  • Figure 1: An example of a comparison between our Safety Chain of Thought (SCoT) defense and conventional safety-aligned defenses against the suppress refusal attack. The conventional safety-aligned model adheres to the instruction to avoid outputting refusal words, thus it is jail-broken. In contrast, our tool proactively assesses the harmful intent of the request and successfully defends against the attack.
  • Figure 2: An overview of the Safety-Chain-of-Thought Methodology.
  • Figure 3: An example of evolved questions with slang and uncommon dialect styles.