Table of Contents
Fetching ...

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao

TL;DR

The paper targets multi-turn jailbreaks in LLMs, showing that existing defense methods trade usability for robustness. It introduces X-Boundary, an explicit boundary approach with three losses to separate harmful from boundary-safe representations and erase harmful content while preserving safe responses. The method achieves state-of-the-art defense against multi-turn attacks with substantially reduced over-refusal and near-maintained general capabilities, supported by theoretical analysis via optimal transport and extensive experiments across multiple models and datasets. The work presents a practical, fine-grained defense that can complement traditional alignment techniques in real-world deployments.

Abstract

Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

TL;DR

The paper targets multi-turn jailbreaks in LLMs, showing that existing defense methods trade usability for robustness. It introduces X-Boundary, an explicit boundary approach with three losses to separate harmful from boundary-safe representations and erase harmful content while preserving safe responses. The method achieves state-of-the-art defense against multi-turn attacks with substantially reduced over-refusal and near-maintained general capabilities, supported by theoretical analysis via optimal transport and extensive experiments across multiple models and datasets. The work presents a practical, fine-grained defense that can complement traditional alignment techniques in real-world deployments.

Abstract

Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.

Paper Structure

This paper contains 40 sections, 3 theorems, 11 equations, 16 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

(Proven in Appendix ap:proof) If $\phi_\# \mu$ is $(n, \Delta)$-clusterable, then for all $m \leq n(2\Delta)^{-2}$, Given a distribution $\mu$, $(n, \Delta)$-clusterable means that $\textnormal{supp}(\mu)$ lies in the union of $n$ balls of radius at most $\Delta$.

Figures (16)

  • Figure 1: Illustration of the representation distinction boundary and the trade-off between multi-turn defense performance and over-refusal of existing defense methods and X-Boundary.
  • Figure 2: Visualization of the representation distribution after implementing SFT, DPO, GA, and CB. "Harmful" and "boundary-safe" refer to the representations of harmful and boundary-safe queries along with their corresponding responses, respectively.
  • Figure 3: Illustration of representation manipulation in X-Boundary for a clear distinction boundary.
  • Figure 4: The training curves of X-Boundary and without X-Boundary on Llama-3-8B-Instruct and Qwen2.5-7B-Chat.
  • Figure 5: Visualization of the representation distribution of X-Boundary and without X-Boundary.
  • ...and 11 more figures

Theorems & Definitions (6)

  • Definition 1: Wasserstein-$1$ $k$-variance
  • Proposition 1
  • Proposition 1
  • proof
  • Definition 1: weed2017sharp
  • Proposition 2: Proven in weed2017sharp