Table of Contents
Fetching ...

Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

Hanjiang Hu, Alexander Robey, Changliu Liu

TL;DR

This work proposes a safety steering framework grounded in safe control theory that achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks.

Abstract

Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home.

Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

TL;DR

This work proposes a safety steering framework grounded in safe control theory that achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks.

Abstract

Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home.

Paper Structure

This paper contains 41 sections, 5 theorems, 28 equations, 11 figures, 15 tables.

Key Result

Theorem 4.2

Given the neural dialogue dynamics in eq:dynamics and the query embeddings $u_k, k=1,2,\dots,K$, the LLM is invariantly safe according to Definition def:safety_invariance if the following inequality conditions hold, where $\phi_k$ is the NBF in Definition def:nbf with query context embedding set ${\mathcal{U}}_{k-1}$.

Figures (11)

  • Figure 1: Overview of safety steering based on neural dialogue dynamics and barrier function.
  • Figure 2: Single-turn vs multi-turn jailbreaks. Queries shiftfrom harmless to harmful 0,0,255255,0,0.
  • Figure 3: Conversation in the language space and state-space representations in the hidden state and embedding space. Queries shiftfrom harmless to harmful 0,0,255255,0,0.
  • Figure 4: Multi-turn jailbreaking conversations with and without NBF-based safety steering. Queries shiftfrom harmless to harmful 0,0,255255,0,0.
  • Figure 5: Trade-off between attack success rate (lower better) by ActorAttack and MTBench helpfulness (higher better) on Llama-3-8b-instruct and Phi-4. The blue line indicates the Pareto front.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Definition 3.1: Invariant Safety in Multi-turn Conversation
  • Definition 4.1: Neural Barrier Function for Multi-turn Dialogue Dynamics
  • Theorem 4.2: Invariant Safety Certificate based on Neural Barrier Function
  • Corollary 4.2.1
  • Definition A.1: Invariant Safety in Multi-turn Conversation
  • Definition A.2: Neural Barrier Function for Multi-turn Dialogue Dynamics
  • Lemma A.3
  • proof
  • Theorem A.4: Invariant Safety Certificate based on Neural Barrier Function
  • proof
  • ...and 2 more