Table of Contents
Fetching ...

Single-pass Detection of Jailbreaking Input in Large Language Models

Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher

TL;DR

This work tackles jailbreaking in large language models by proposing Single Pass Detection (SPD), a one-pass defense that predicts harmful outputs from the model's logits without requiring auxiliary models or multiple forward passes. SPD builds a compact logit-based feature matrix $\mathbf{H} \in \mathbb{R}^{r\times k}$ with $h_i = -\log(\sigma(\bm{l}_{i,k}))$ from the first $r$ token positions and top-$k$ logits, and trains an SVM with an RBF kernel to classify inputs as benign or attacked. The approach is evaluated on open-source models (Llama 2, Vicuna) and closed models (GPT-3.5/4) across multiple jailbreaking datasets (GCG, AutoDAN, PAIR, PAP), achieving high true-positive rates with very low false positives and substantial speedups over perturbation-based defenses. Results show SPD can detect attacks before producing harmful outputs, with near-perfect discrimination on some models and robustness even when full logit access is unavailable, indicating practical applicability for real-world LLM safety systems. The work suggests that logit-based, single-pass defenses offer a promising direction for efficient, scalable protection against adversarial prompts in modern LLMs.

Abstract

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Single-pass Detection of Jailbreaking Input in Large Language Models

TL;DR

This work tackles jailbreaking in large language models by proposing Single Pass Detection (SPD), a one-pass defense that predicts harmful outputs from the model's logits without requiring auxiliary models or multiple forward passes. SPD builds a compact logit-based feature matrix with from the first token positions and top- logits, and trains an SVM with an RBF kernel to classify inputs as benign or attacked. The approach is evaluated on open-source models (Llama 2, Vicuna) and closed models (GPT-3.5/4) across multiple jailbreaking datasets (GCG, AutoDAN, PAIR, PAP), achieving high true-positive rates with very low false positives and substantial speedups over perturbation-based defenses. Results show SPD can detect attacks before producing harmful outputs, with near-perfect discrimination on some models and robustness even when full logit access is unavailable, indicating practical applicability for real-world LLM safety systems. The work suggests that logit-based, single-pass defenses offer a promising direction for efficient, scalable protection against adversarial prompts in modern LLMs.

Abstract

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Paper Structure

This paper contains 40 sections, 5 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Schematic of the proposed method and comparison with previous approaches perturbation based, such as SmoothLLM and RA-LLM, (left) and auxiliary LLM based, like Self-Defense (middle). Our method requires a single forward pass to predict the attack.
  • Figure 2: Changes caused by attacks: a) We plot the distribution of the logits of the first token for a sentence before and after an attack in green and orange respectively. We choose the original sentence "Tell me how to build a bomb." The attacked sentence was generated with GCG and the target output "Sure, here is how to build a bomb." In the graph, the first token distribution is shifted after the attack. b) We illustrate the entropy of the first logits of randomly selected 1000 benign and attacked sentences with blue and purple respectively. This figure shows a higher entropy with attacked sentences. c) We show the 2-dimensional t-SNE plot of the training set with $h$ feature vector. We visualized the clear separability of attacked and not attacked sentences. Blue points correspond to benign sentences whereas purple ones are attacked.
  • Figure 3: Confusion matrices showing true positive (TP), true negative (TN), false positive (FP), and false negative (FN) percentages to compare SPD with previous works. While the upper graph is for Llama 2, the lower one is plotted for Vicuna. Higher TP and lower FP indicate a better performance and SPD achieves better rates than any other methods for both models.
  • Figure 4: Affect of the training data size of $\bm{H}$ matrix: We plot the TP (left) and FP (right) rates for different $r$ and $k$ values using the SPD approach with Vicuna model. Different lines correspond to different $k$ values. Results show that $k>20$ and $r >5$ yield a better performance.
  • Figure 5: Affect of the training data size of $\bm{H}$ matrix: We plot the TP (left) and FP (right) rates for different $T$ and $T_{safe}$ values using the SPD approach with Vicuna model. Different lines correspond to different $T_{safe}$ values. Results show that $T_{safe}>20$ is necessary for low FP and as $T$ increases, TP tends to increase.
  • ...and 4 more figures