SoK: Evaluating Jailbreak Guardrails for Large Language Models

Xunguang Wang; Zhenlan Ji; Wenxuan Wang; Zongjie Li; Daoyuan Wu; Shuai Wang

SoK: Evaluating Jailbreak Guardrails for Large Language Models

Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang

TL;DR

This SoK analyzes jailbreak guardrails for large language models, addressing fragmentation by introducing a six-dimension taxonomy and a Security-Efficiency-Utility (SEU) evaluation framework. It demonstrates how guardrails operate across intervention stages, paradigms, and granularity, and provides empirical insights from benchmarks on open- and closed-source LLMs. Key findings show that GuardReasoner variants offer strong defense but incur high latency and memory costs, while pre-processing guardrails tend to be more latency-efficient; session-level guards struggle with sophisticated multi-turn and cross-attack scenarios. The work highlights practical implications for deploying guardrails—emphasizing cross-attack robustness, adaptivity, and transparent decision-making—and provides a public codebase for reproducibility.

Abstract

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety alignments. Guardrails--external defense mechanisms that monitor and control LLM interactions--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, provide insights into optimizing their defense mechanisms, and explore their universality across attack types. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.

SoK: Evaluating Jailbreak Guardrails for Large Language Models

TL;DR

Abstract

SoK: Evaluating Jailbreak Guardrails for Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)