Table of Contents
Fetching ...

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Guangyu Yang, Jinghong Chen, Jingbiao Mei, Weizhe Lin, Bill Byrne

TL;DR

The paper tackles jailbreaking of LLMs by introducing Retrieval-Augmented Defense (RAD), a training-free detector that leverages a database of attack examples to infer underlying jailbreak strategies and compute $P(\text{harmful}|\mathbf{x})$ for decision making. RAD operates through a five-step retrieval pipeline (Retrieve, Rerank, Extract, Classify, Vote) and a generator threshold $\tau$ to block or forward prompts without altering the target model. Experiments on the StrongREJECT benchmark show RAD substantially reduces jailbreak attack effectiveness across multiple target models while preserving performance on benign queries and general QA tasks. An FRR-based operating-curve evaluation demonstrates a tunable safety-utility frontier, supporting practical deployment with controllable trade-offs.

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

TL;DR

The paper tackles jailbreaking of LLMs by introducing Retrieval-Augmented Defense (RAD), a training-free detector that leverages a database of attack examples to infer underlying jailbreak strategies and compute for decision making. RAD operates through a five-step retrieval pipeline (Retrieve, Rerank, Extract, Classify, Vote) and a generator threshold to block or forward prompts without altering the target model. Experiments on the StrongREJECT benchmark show RAD substantially reduces jailbreak attack effectiveness across multiple target models while preserving performance on benign queries and general QA tasks. An FRR-based operating-curve evaluation demonstrates a tunable safety-utility frontier, supporting practical deployment with controllable trade-offs.

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.

Paper Structure

This paper contains 35 sections, 10 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of the Retrieval-Augmented Defense (RAD) system. The pipeline consists of three components: (1) the attacker (top-left) generates an input prompt $\mathbf{x}$ from a user query q using an attack strategy $\mathcal{S}$; (2) the defender (bottom) processes $\mathbf{x}$ through five steps—Retrieve, Rerank, Extract, Classify, and Vote—to compute the probability $P(\text{harmful}|\mathbf{x})$ based on known strategies retrieved from the database; and (3) the generator (top-right) compares $P(\text{harmful}|\mathbf{x})$ with a threshold $\tau$ to either block the input or pass it to the target model.
  • Figure 2: Operating curves of PAP defense score (100 - PAP attack score) versus False Refusal Rate (FRR) budget for three target models: Qwen-2.5-14B-Instruct, GPT-4o-mini, and Llama-3.1-8B-Instruct. Solid and dashed lines show the RAD-Base-14B and RAD-Sim-14B configurations respectively. Each prior defense system is shown as a distinct marker.
  • Figure 3: Operating curves of RAD-Sim-14B showing PAP defense scores (100 - attack scores) versus False Refusal Rate (FRR) budget on three target models. Circles indicate performance with PAP examples added in the RAD database, and triangles show performance without PAP examples. Across all FRR budgets and target models, RAD achieves better performance with PAP examples in the database.
  • Figure 4: PAP Defense scores (100 - attack score) achieved by RAD-Sim-14B as the DAN examples are added to the database (10k-500k added DAN examples). Each curve corresponds to a different FRR budget. Defense performance against PAP remains stable as up to 500K additional examples are added to the database. The Llama-3.1-8B-Instruct target model misses the 1% and 2.5% FRR budget curves because the lowest possible FRR that can be achieved is larger than 2.5%.