Table of Contents
Fetching ...

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang

TL;DR

SafeBehavior tackles jailbreaking risks in large language models by simulating human-like multistage reasoning within the inference process. It integrates three stages—$\text{intention inference}$, $\text{self-introspection}$, and $\text{self-revision}$—with a continuous jailbreak confidence $S_r$ and a threshold $\tau$ to decide on refusal, acceptance, or revision. Across five attack types and two base models, SafeBehavior achieves near-zero ASR and zero FPR while preserving reasoning capabilities and maintaining efficiency, outperforming seven baselines. The work demonstrates that hierarchical, adaptive defense can robustly counter diverse jailbreak strategies with practical deployment potential.

Abstract

Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks, but their growing power also amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms. Existing defenses including input paraphrasing, multi step evaluation, and safety expert models often suffer from high computational costs, limited generalization, or rigid workflows that fail to detect subtle malicious intent embedded in complex contexts. Inspired by cognitive science findings on human decision making, we propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans. SafeBehavior decomposes safety evaluation into three stages: intention inference to detect obvious input risks, self introspection to assess generated responses and assign confidence based judgments, and self revision to adaptively rewrite uncertain outputs while preserving user intent and enforcing safety constraints. We extensively evaluate SafeBehavior against five representative jailbreak attack types including optimization based, contextual manipulation, and prompt based attacks and compare it with seven state of the art defense baselines. Experimental results show that SafeBehavior significantly improves robustness and adaptability across diverse threat scenarios, offering an efficient and human inspired approach to safeguarding LLMs against jailbreak attempts.

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

TL;DR

SafeBehavior tackles jailbreaking risks in large language models by simulating human-like multistage reasoning within the inference process. It integrates three stages—, , and —with a continuous jailbreak confidence and a threshold to decide on refusal, acceptance, or revision. Across five attack types and two base models, SafeBehavior achieves near-zero ASR and zero FPR while preserving reasoning capabilities and maintaining efficiency, outperforming seven baselines. The work demonstrates that hierarchical, adaptive defense can robustly counter diverse jailbreak strategies with practical deployment potential.

Abstract

Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks, but their growing power also amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms. Existing defenses including input paraphrasing, multi step evaluation, and safety expert models often suffer from high computational costs, limited generalization, or rigid workflows that fail to detect subtle malicious intent embedded in complex contexts. Inspired by cognitive science findings on human decision making, we propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans. SafeBehavior decomposes safety evaluation into three stages: intention inference to detect obvious input risks, self introspection to assess generated responses and assign confidence based judgments, and self revision to adaptively rewrite uncertain outputs while preserving user intent and enforcing safety constraints. We extensively evaluate SafeBehavior against five representative jailbreak attack types including optimization based, contextual manipulation, and prompt based attacks and compare it with seven state of the art defense baselines. Experimental results show that SafeBehavior significantly improves robustness and adaptability across diverse threat scenarios, offering an efficient and human inspired approach to safeguarding LLMs against jailbreak attempts.

Paper Structure

This paper contains 24 sections, 6 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: ASR of existing defenses under jailbreak attacks, showing the limitations of input-only methods and motivating multistage reasoning defenses.
  • Figure 2: Overview of SafeBehavior, which integrates intention inference, self-introspection, and self-revision to simulate human-like reasoning and defend against jailbreak attacks.
  • Figure 3: Computation time of different defense methods based on Qwen2.5-Instruct.
  • Figure 4: Computation time of different defense methods based on Mistral-7B-Instruct.
  • Figure 5: Sensitivity analysis of $\tau$.