X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Xiaoya Lu; Dongrui Liu; Yi Yu; Luxin Xu; Jing Shao

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao

TL;DR

The paper targets multi-turn jailbreaks in LLMs, showing that existing defense methods trade usability for robustness. It introduces X-Boundary, an explicit boundary approach with three losses to separate harmful from boundary-safe representations and erase harmful content while preserving safe responses. The method achieves state-of-the-art defense against multi-turn attacks with substantially reduced over-refusal and near-maintained general capabilities, supported by theoretical analysis via optimal transport and extensive experiments across multiple models and datasets. The work presents a practical, fine-grained defense that can complement traditional alignment techniques in real-world deployments.

Abstract

Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

TL;DR

Abstract

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (6)