Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Guobin Shen; Dongcheng Zhao; Yiting Dong; Xiang He; Yi Zeng

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Guobin Shen, Dongcheng Zhao, Yiting Dong, Xiang He, Yi Zeng

TL;DR

The paper tackles the challenge of maintaining safety in LLMs without sacrificing utility under jailbreak attacks. It introduces Jailbreak Antidote, which computes a safety direction via PCA on internal hidden states and perturbs a sparse subset of the last-token representations during inference using a scaling factor $\alpha$, affecting only about $5\%$ of dimensions. Extensive experiments across nine models (2B–72B) and ten jailbreak methods demonstrate high defense success with minimal impact on benign performance, particularly on larger models where $100\%$ DSR is achieved in some cases. The approach offers a practical, real-time mechanism for safety adjustments that avoids input prompts or retraining, with potential applicability to broader alignment challenges and real-world AI deployments.

Abstract

As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. Moreover, overly restrictive safety measures can degrade model utility by causing refusals of benign queries. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model's internal states during inference. By shifting the model's hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional token overhead or inference delays. Our analysis reveals that safety-related information in LLMs is sparsely distributed; adjusting approximately 5% of the internal state is as effective as modifying the entire state. Extensive experiments on nine LLMs (ranging from 2 billion to 72 billion parameters), evaluated against ten jailbreak attack methods and compared with six defense strategies, validate the effectiveness and efficiency of our approach. By directly manipulating internal states during reasoning, Jailbreak Antidote offers a lightweight, scalable solution that enhances LLM safety while preserving utility, opening new possibilities for real-time safety mechanisms in widely-deployed AI systems.

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

TL;DR

, affecting only about

of dimensions. Extensive experiments across nine models (2B–72B) and ten jailbreak methods demonstrate high defense success with minimal impact on benign performance, particularly on larger models where

DSR is achieved in some cases. The approach offers a practical, real-time mechanism for safety adjustments that avoids input prompts or retraining, with potential applicability to broader alignment challenges and real-world AI deployments.

Abstract

Paper Structure (61 sections, 7 equations, 16 figures, 11 tables)

This paper contains 61 sections, 7 equations, 16 figures, 11 tables.

Introduction
Related Work
Jailbreak Attacks on LLMs
Defense Methods Against Jailbreak Attacks
Mechanistic Interpretability and Internal State Manipulation
Preliminaries
Jailbreak Attacks and Defenses
Internal Representations in LLMs
Method: Jailbreak Antidote
Identifying and Leveraging the Safety Direction
Sparsity in the Safety Representation
Adjusting Internal States During Inference
Balancing Safety and Utility
Experiments
Experimental Setup
...and 46 more sections

Figures (16)

Figure 1: Overview of Jailbreak Antidote. (a) Obtaining the safety direction $\mathbf{d}_{\text{safe}}$ using PCA on hidden states from benign and harmful prompts. (b) Adjusting the internal state $\mathbf{h}_{S'}$ of the adversarial prompt $S'$ by shifting it towards $\mathbf{d}_{\text{safe}}$ during inference. $S_0$ represents the original harmful prompt, and $S'$ represents the adversarial attack prompt. The example uses a past-tense attack. (c) Comparison on Llama-3.1-8B-it, with lines representing different $k\%$ values. Points along each line correspond to varying $\alpha$ values. The baseline point shows the performance of the original model without defense.
Figure 2: (a) t-SNE visualization of hidden states of benign prompts, harmful prompts, and adversarial prompts (PAIR and GCG) at different layers in Llama-3.1-8B-it. The safety direction $\mathbf{d}_{\text{safe}}^l$ is indicated by the arrows. In deeper layers, attack prompts are positioned between the benign and harmful clusters, indicating how attacks manipulate internal states. (b) Distribution of the components of $\mathbf{d}_{\text{safe}}^l$ at different layers, showing a long-tailed distribution that indicates sparsity in safety representations.
Figure 3: DSR heatmaps for different attack-defense combinations on (a) Phi-3-mini-it, (b) Qwen-1.5-7B-it, and (c) Llama-3-70B-it. Rows represent defense methods; columns represent attack methods.
Figure 4: Runtime per Query versus DSR for different defense methods across various models. Each point represents a defense method, with the x-axis showing the average runtime per query (seconds) and the y-axis showing the DSR.
Figure 5: Impact of the scaling factor $\alpha$ on DSR and Win Rate for different sparsity levels $k$. The left y-axis represents Win Rate (bars), and the right y-axis represents DSR (lines). (a) Qwen-2-7B-it, (b) Llama-3.1-8B-it. Different colors represent different $k\%$ values.
...and 11 more figures

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

TL;DR

Abstract

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)