Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher
TL;DR
This work tackles jailbreaking in large language models by proposing Single Pass Detection (SPD), a one-pass defense that predicts harmful outputs from the model's logits without requiring auxiliary models or multiple forward passes. SPD builds a compact logit-based feature matrix $\mathbf{H} \in \mathbb{R}^{r\times k}$ with $h_i = -\log(\sigma(\bm{l}_{i,k}))$ from the first $r$ token positions and top-$k$ logits, and trains an SVM with an RBF kernel to classify inputs as benign or attacked. The approach is evaluated on open-source models (Llama 2, Vicuna) and closed models (GPT-3.5/4) across multiple jailbreaking datasets (GCG, AutoDAN, PAIR, PAP), achieving high true-positive rates with very low false positives and substantial speedups over perturbation-based defenses. Results show SPD can detect attacks before producing harmful outputs, with near-perfect discrimination on some models and robustness even when full logit access is unavailable, indicating practical applicability for real-world LLM safety systems. The work suggests that logit-based, single-pass defenses offer a promising direction for efficient, scalable protection against adversarial prompts in modern LLMs.
Abstract
Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.
