Table of Contents
Fetching ...

AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks

Charlotte Siska, Anush Sankaran

TL;DR

AttentionDefense introduces an explainable, compute-efficient jailbreak detector that leverages system-prompt attention weights from small language models. By extracting last-layer attention across system-prompt tokens and heads and feeding a classifier, it detects adversarial prompts with high precision and competitive F1, matching or surpassing embedding-based and some LLM detectors while requiring far fewer parameters. The method generalizes well to novel jailbreaks generated by an ALMAS closed-loop framework and is robust to known attacks, offering a practical defense with scalable deployment potential. Overall, the approach demonstrates that system-prompt attention provides a reliable, interpretable signal for jailbreak detection beyond semantic embeddings, enabling cheaper yet effective defenses in real-world settings.$

Abstract

In the past few years, Language Models (LMs) have shown par-human capabilities in several domains. Despite their practical applications and exceeding user consumption, they are susceptible to jailbreaks when malicious input exploits the LM's weaknesses, causing it to deviate from its intended behavior. Current defensive strategies either classify the input prompt as adversarial or prevent LMs from generating harmful outputs. However, it is challenging to explain the reason behind the malicious nature of the jailbreak, which results in a wide variety of closed-box approaches. In this research, we propose and demonstrate that system-prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts, providing a novel, explainable, and cheaper defense approach called AttentionDefense. Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious input that is not captured in the semantic meaning of text embeddings. The proposed AttentionDefense is evaluated against existing jailbreak benchmark datasets. Ablation studies show that SLM-based AttentionDefense has equivalent or better jailbreak detection performance compared to text embedding-based classifiers and GPT-4 zero-shot detectors.To further validate the efficacy of the proposed approach, we generate a dataset of novel jailbreak variants of the existing benchmark dataset using a closed-loop LLM-based multi-agent system. We demonstrate that the proposed AttentionDefense approach performs robustly on this novel jailbreak dataset while existing approaches suffer in performance. Additionally, for practical purposes AttentionDefense is an ideal solution as it has the computation requirements of a small LM but the performance of a LLM detector.

AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks

TL;DR

AttentionDefense introduces an explainable, compute-efficient jailbreak detector that leverages system-prompt attention weights from small language models. By extracting last-layer attention across system-prompt tokens and heads and feeding a classifier, it detects adversarial prompts with high precision and competitive F1, matching or surpassing embedding-based and some LLM detectors while requiring far fewer parameters. The method generalizes well to novel jailbreaks generated by an ALMAS closed-loop framework and is robust to known attacks, offering a practical defense with scalable deployment potential. Overall, the approach demonstrates that system-prompt attention provides a reliable, interpretable signal for jailbreak detection beyond semantic embeddings, enabling cheaper yet effective defenses in real-world settings.$

Abstract

In the past few years, Language Models (LMs) have shown par-human capabilities in several domains. Despite their practical applications and exceeding user consumption, they are susceptible to jailbreaks when malicious input exploits the LM's weaknesses, causing it to deviate from its intended behavior. Current defensive strategies either classify the input prompt as adversarial or prevent LMs from generating harmful outputs. However, it is challenging to explain the reason behind the malicious nature of the jailbreak, which results in a wide variety of closed-box approaches. In this research, we propose and demonstrate that system-prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts, providing a novel, explainable, and cheaper defense approach called AttentionDefense. Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious input that is not captured in the semantic meaning of text embeddings. The proposed AttentionDefense is evaluated against existing jailbreak benchmark datasets. Ablation studies show that SLM-based AttentionDefense has equivalent or better jailbreak detection performance compared to text embedding-based classifiers and GPT-4 zero-shot detectors.To further validate the efficacy of the proposed approach, we generate a dataset of novel jailbreak variants of the existing benchmark dataset using a closed-loop LLM-based multi-agent system. We demonstrate that the proposed AttentionDefense approach performs robustly on this novel jailbreak dataset while existing approaches suffer in performance. Additionally, for practical purposes AttentionDefense is an ideal solution as it has the computation requirements of a small LM but the performance of a LLM detector.

Paper Structure

This paper contains 13 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Figure demonstrating the intensity of attention weights across system prompt tokens during an LM inference. The harmful payload (highlighted in yellow) and jailbreak mechanism (highlighted in green) shifts the system prompt attention differently as compared to a benign prompt.
  • Figure 2: End-to-end pipeline for (1) training explainable jailbreak detection using AttentionDefense, (2) evaluation and protecting LM models against known and unknown jailbreak attacks.
  • Figure 3: ALMAS framework (Attack using LLM based Multi-Agent Systems) to generate novel jailbreak variants of existing benchmark datasets.
  • Figure 4: Jailbreak prompt example (top) and benign prompt example (bottom) along with system-prompt attention projected using the first generation token with payload 1 (refer, Table \ref{['tab:jailbreak_payload_instructions']}) and mechanism 2 system-prompt (refer, Table \ref{['tab:jailbreak_mechanism_instructions']})
  • Figure 5: F1 scores for AttentionDefense RandomForest system prompt experiments based on ALMAS novel jailbreaks. In Figure \ref{['subfig:systemprompt_rf_f1_main']}, F1 scores displayed are the maximum for that system prompt across a range of possible thresholds. In Figure \ref{['subfig:systemprompt_rf_f1_highprecision']}, F1 scores are with precision > 0.99. The ith payload and jth mechanism used in the system prompt are listed in Tables \ref{['tab:jailbreak_payload_instructions']} and \ref{['tab:jailbreak_mechanism_instructions']}. In the heatmap, each cell is the F1 of an AttentionDefense with a system prompt containing column i payload and row j mechanism. If column i or row j is None, that means that the payload or mechanism is absent from the system prompt.
  • ...and 1 more figures