Table of Contents
Fetching ...

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

This work formalizes the refusal loss φ_θ(x) and its landscape to differentiate malicious jailbreak prompts from benign queries. It introduces Gradient Cuff, a two-step, training-free detector that first screens by refusal likelihood and then uses a zeroth-order gradient estimate to assess the gradient norm, enabling robust jailbreak detection. Across two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six jailbreak methods (GCG, AutoDAN, PAIR, TAP, Base64, LRL), Gradient Cuff significantly reduces attack success while preserving performance on nonharmful queries, outperforming several baselines and remaining effective under adaptive attacks. The method is complementary to prompt-based defenses and offers a practical, inference-time solution with reasonable compute and utility trade-offs, enhancing safety for deployed LLM services.

Abstract

Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, this paper defines and investigates the Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect jailbreak attempts. Gradient Cuff exploits the unique properties observed in the refusal loss landscape, including functional values and its smoothness, to design an effective two-step detection strategy. Experimental results on two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can significantly improve the LLM's rejection capability for malicious jailbreak queries, while maintaining the model's performance for benign user queries by adjusting the detection threshold.

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

TL;DR

This work formalizes the refusal loss φ_θ(x) and its landscape to differentiate malicious jailbreak prompts from benign queries. It introduces Gradient Cuff, a two-step, training-free detector that first screens by refusal likelihood and then uses a zeroth-order gradient estimate to assess the gradient norm, enabling robust jailbreak detection. Across two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six jailbreak methods (GCG, AutoDAN, PAIR, TAP, Base64, LRL), Gradient Cuff significantly reduces attack success while preserving performance on nonharmful queries, outperforming several baselines and remaining effective under adaptive attacks. The method is complementary to prompt-based defenses and offers a practical, inference-time solution with reasonable compute and utility trade-offs, enhancing safety for deployed LLM services.

Abstract

Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, this paper defines and investigates the Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect jailbreak attempts. Gradient Cuff exploits the unique properties observed in the refusal loss landscape, including functional values and its smoothness, to design an effective two-step detection strategy. Experimental results on two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can significantly improve the LLM's rejection capability for malicious jailbreak queries, while maintaining the model's performance for benign user queries by adjusting the detection threshold.
Paper Structure (34 sections, 1 theorem, 24 equations, 9 figures, 13 tables, 4 algorithms)

This paper contains 34 sections, 1 theorem, 24 equations, 9 figures, 13 tables, 4 algorithms.

Key Result

Theorem 1

Let $\|\cdot \|$ denote a vector norm and assume $\nabla \phi_\theta(x)$ is $L$-Lipschitz continuous. With probability at least $1- \delta$, the approximation error of $\nabla \phi_\theta(x)$ satisfies for some $\epsilon > 0$, where $\delta = \Omega$l(t)=\Omega(s(t))$ means $s(t)$ is the infimum of $l(t)$({\frac{1}{N}}+{\frac{1}{P}})$ and $\epsilon = \Omega(\frac{1}{\sqrt{P}})$.

Figures (9)

  • Figure 1: Overview of Gradient Cuff. (a) introduces an example of jailbreak prompts by presenting a conversation between malicious actors and the Vicuna chatbot. (b) visualizes the refusal loss landscape for malicious queries and benign queries by plotting the interpolation of two random directions in the query embedding with coefficients $\alpha$ and $\beta$ following visualize. The refusal loss evaluates the probability that the LLM would not directly reject the input query, and the loss value is computed using Equation \ref{['eq:refusal_loss']}. See details of how to plot (b) in Appendix \ref{['subapp:loss_landscape']}. (c) shows the running flow of Gradient Cuff (at top), practical computing examples for refusal loss (at bottom left), and the distributional difference of the gradient norm of refusal loss on benign and malicious queries (bottom right). (d) shows the performance of Gradient Cuff against 6 jailbreak attacks for Vicuna-7B-V1.5. See Appendix \ref{['subapp:asr']} for full results.
  • Figure 2: Performance evaluation on LLaMA2-7B-Chat (a) and Vicuna-7B-V1.5 (b). The horizon axis represents the refusal rate of benign user queries (FPR), and the vertical axis shows the average refusal rate across 6 malicious user query datasets (TPR). The error bar shows the standard deviation between the refusal rate of these 6 jailbreak datasets. We also report the MMLU accuracy of Low-FPR methods to show their utility. Complete results can be found in Appendix \ref{['subapp:complete']}.
  • Figure 3: Performance comparison against adaptive jailbreak attacks.
  • Figure 4: Utility evaluation on MMLU mmlu (zero-shot) with and without Gradient Cuff.
  • Figure A1: Attack success rate of 6 jailbreak attacks evaluated on 2 aligned LLMs.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Theorem 1