Table of Contents
Fetching ...

SPIN: Self-Supervised Prompt INjection

Leon Zhou, Junfeng Yang, Chengzhi Mao

TL;DR

Self-supervised Prompt INjection (SPIN) is introduced which can detect and reverse these various attacks on LLMs and is compatible with existing alignment and adds an additional layer of safety for defense.

Abstract

Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain as major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. As our self-supervised prompt defense is done at inference-time, it is also compatible with existing alignment and adds an additional layer of safety for defense. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.

SPIN: Self-Supervised Prompt INjection

TL;DR

Self-supervised Prompt INjection (SPIN) is introduced which can detect and reverse these various attacks on LLMs and is compatible with existing alignment and adds an additional layer of safety for defense.

Abstract

Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain as major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. As our self-supervised prompt defense is done at inference-time, it is also compatible with existing alignment and adds an additional layer of safety for defense. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.

Paper Structure

This paper contains 19 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Self-Supervised Prompt Injection for LLM Defense. Large language models can be jailbroken by an adversarial prompt. In this paper, we show that we can detect and defend jailbroken input by leveraging self-supervised language tasks. The red text shows the adversarial prompt injection to bypass the safety guardrails of LLMs. With SPIN we output the safe response again.
  • Figure 2: Self-supervised Prompt Injections. Examples of inputs at each stage of the defense. The red text is the adversarial suffix. On the left side, all the malicious examples with the trigger attack are discovered or reversed. The repeat task fails to reproduce the adversarial suffix in red because the token sequence becomes too random. Even though it does repeat the sentence partially, the difference is enough that the detection system will detect it as being malicious. The blue under the 'Reversal' task represent our defensive prompt injections which reverses malicious attacks, but does not impact benign inputs. On the right side, we show the same benign question passing through the layers without issue. Our defense systems is shown to defend against attacks that would otherwise break alignment, while maintaining performance on benign queries.
  • Figure 3: Self Supervised Detection of Jailbreak Attacks. We detect jailbreak prompts through self-supervised natural language tasks. Here we show the 'Repeat' task and the 'Interject' task that form the detection layer. We are able to verify if the prompt passes the tests since we know the answer beforehand, either repeating the entire input or mentioning that Paris is the capital of France. Since jailbreak prompts often inject unusual patterns into the text, the model will often underperform on our created self-supervised language tasks compared to a benign user query. Our method uses this difference for detection.
  • Figure 4: Self-supervised loss and ROC Curve for benign and malicious inputs. We show $\mathcal{L}_{\text{repeater}}$ loss on the left and $\mathcal{L}_{\text{interject}}$ on the right. For the repeat task, we find the model can repeat benign queries from the TriviaQA task well, but fails to repeat consistently when the query is from Advbench. For the interject task, the model is asked to answer 'What is the capital of France?' and the loss is for predicting whether the next token will be 'Paris'. As seen by the malicious instruction prompts, the loss increases only with the addition of the suffix. The divergent self-supervised loss values between malicious and benign allow us to detect malicious query. For the ROC curves, benign inputs are categorized as negatives, and malicious as positive. Losses for these graphs are calculated using Llama-2, with the best thresholds for the loss being 0.89 and 6.55 respectively.
  • Figure 5: Multi-Benchmark Attack Success Rate (ASR). We show performance of our defense against multiple types of attacks and benign inputs (Except for TriviaQA, lower indicates better performance). Our defense is robust and dramatically lowers ASR across all attacks. Even in cases where alignment is effective, SPIN is shown to decrease the ASR further.
  • ...and 2 more figures