Table of Contents
Fetching ...

DROJ: A Prompt-Driven Attack against Large Language Models

Leyang Hu, Boran Wang

TL;DR

A novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), is introduced, which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100\% keyword-based Attack Success Rate (ASR), effectively preventing direct refusals. However, the model occasionally produces repetitive and non-informative responses. To mitigate this, we introduce a helpfulness system prompt that enhances the utility of the model's responses. Our code is available at https://github.com/Leon-Leyang/LLM-Safeguard.

DROJ: A Prompt-Driven Attack against Large Language Models

TL;DR

A novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), is introduced, which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100\% keyword-based Attack Success Rate (ASR), effectively preventing direct refusals. However, the model occasionally produces repetitive and non-informative responses. To mitigate this, we introduce a helpfulness system prompt that enhances the utility of the model's responses. Our code is available at https://github.com/Leon-Leyang/LLM-Safeguard.

Paper Structure

This paper contains 20 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Prepending a safety prompt (right) to the input query can help the model refuse to respond to malicious prompts it might otherwise comply with.
  • Figure 2: Visualization of the hidden state of Mistral-7B-Instruct-v0.1 after applying 2-dimensional Principal Component Analysis (PCA). Left: Visualization of eight groups of queries (harmful/harmless + three kinds of safety prompts). The boundary (black dotted line) is fitted by logistic regression, showing that harmful and harmless queries can be distinguished without safety prompts. The red and blue arrows indicate the similar movement direction of both harmful and harmless queries, respectively, when safety prompts are added. Right: Color represents the empirical refusal rate, with the fitted boundary (gray line) indicating separation of accepted and refused queries. The gray arrow (normal vector of the fitted regression) indicates the direction of higher refusal probability. Figure reproduced from zheng2024prompt.
  • Figure 3: Visualization of training (left) and testing (right) queries' hidden representations in LLaMA-2-7b-chat in the first two components of PCA. The color of each symbol represents the empirical refusal rate of the query. The gray arrow denotes the direction of refusal. In both panels, the red arrow illustrates the trajectory of harmful queries (Harmful in training; AdvBench and MaliciousInstruct in testing), indicating their movement away from the refusal direction after the addition of the jailbreak prompt. Similarly, the blue arrow traces the trajectory of harmless queries (Harmless in both training and testing), showing their movement with the jailbreak prompt. These movements suggest that both types of queries are moved away from the direction of refusal, with empirical data suggesting that harmful queries are less likely to be refused with the jailbreak prompt.
  • Figure 4: Comparison of model's responses to queries prefixed with DROJ without the helpfulness prompt (left) and with the helpfulness prompt (right). In the absence of the helpfulness prompt, the model often produces non-refusal yet uninformative answers. In comparison, with addition of the helpfulness prompt, the model outputs more useful information. The '$<$jailbreak_prompt_n$>$' placeholders represent DROJ jailbreak prompts, which are special tokens defined in the tokenizer to denote the locations of these prompts at embedding level. The text in blue is the helpfulness system prompt.