Table of Contents
Fetching ...

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Xunguang Wang, Wenxuan Wang, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Daoyuan Wu, Shuai Wang

TL;DR

STShield addresses jailbreak vulnerabilities in large language models by integrating a real-time, single-token sentinel into the output stream. It trains the model with supervised fine-tuning on normal prompts and adversarial, embedding-space perturbations to reflect jailbreak attempts, producing a binary safety indicator that governs output filtering. Empirical results show substantial reductions in jailbreak success across diverse attack types and models with minimal latency and modest utility trade-offs, outperforming several baselines. This approach offers a practical defense for real-world LLM deployment by eliminating heavy external detectors while preserving user-facing performance.

Abstract

Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

TL;DR

STShield addresses jailbreak vulnerabilities in large language models by integrating a real-time, single-token sentinel into the output stream. It trains the model with supervised fine-tuning on normal prompts and adversarial, embedding-space perturbations to reflect jailbreak attempts, producing a binary safety indicator that governs output filtering. Empirical results show substantial reductions in jailbreak success across diverse attack types and models with minimal latency and modest utility trade-offs, outperforming several baselines. This approach offers a practical defense for real-world LLM deployment by eliminating heavy external detectors while preserving user-facing performance.

Abstract

Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

Paper Structure

This paper contains 16 sections, 9 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Case analysis of STShield on a jailbreak prompt from DAN.
  • Figure 2: Case analysis of STShield on a failed jailbreak prompt from DAN.
  • Figure 3: Case analysis of STShield on a normal prompt from AlpacaEval.