Table of Contents
Fetching ...

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Sravanti Addepalli, Yerram Varun, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain

TL;DR

This paper probes the robustness of safety-aligned LLMs to natural prompts that are semantically related to toxic seed prompts. It introduces Response Guided Question Augmentation (ReG-QA), a two-stage pipeline that uses an unaligned LLM to generate toxic answers from a seed question and a safety-aligned LLM to produce diverse, natural questions that could elicit those answers, without optimizing for jailbreaks. Evaluated on JailbreakBench across models like GPT-4 and GPT-3.5, ReG-QA achieves high attack success rates ($82\%$ and $93\%$, respectively) and outperforms paraphrase-based baselines, while remaining robust to defenses such as Smooth-LLM and Synonym Substitution. The results reveal significant generalization gaps in current safety training and motivate the development of stronger defenses and evaluation protocols for safety generalization in LLMs.

Abstract

Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

TL;DR

This paper probes the robustness of safety-aligned LLMs to natural prompts that are semantically related to toxic seed prompts. It introduces Response Guided Question Augmentation (ReG-QA), a two-stage pipeline that uses an unaligned LLM to generate toxic answers from a seed question and a safety-aligned LLM to produce diverse, natural questions that could elicit those answers, without optimizing for jailbreaks. Evaluated on JailbreakBench across models like GPT-4 and GPT-3.5, ReG-QA achieves high attack success rates ( and , respectively) and outperforms paraphrase-based baselines, while remaining robust to defenses such as Smooth-LLM and Synonym Substitution. The results reveal significant generalization gaps in current safety training and motivate the development of stronger defenses and evaluation protocols for safety generalization in LLMs.

Abstract

Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.

Paper Structure

This paper contains 19 sections, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Schematic diagram of data distributions highlighting different types of jailbreak questions: Let R4 denote the space of all text which may or may not have semantic meaning, R3 denote a subset of R4 containing text with semantic meaning, R2 denote the pre-training data distribution, and R0 denote the fine-tuning data distribution, with R1 being the region close to the fine-tuning data distribution. Note that R0 may not always be a subset of R2. R0 is considered to be the region where the LLM is trained to give safe (denial) responses as a result of SFT/RLHF based safety fine-tuning. We depict different methods of modifying a toxic seed question that results in a safe denial response (denoted by a green cross in R0), into a jailbreak that results in a toxic response (denoted by red cross). While prompts close to R0 have strict constraints on naturalness of meaning and content, and are thus considered to be safer by virtue of generalization of safety training, prompts closer to R4 can be constructed to overcome the underlying safety mechanism.
  • Figure 2: Diagram describing various steps of our method Response Guided Question Augmentation (ReG-QA). From a seed question, we use an unaligned LLM to generate multiple answers, each of which is passed to another LLM to generate questions that would give that answer.
  • Figure 3: Attack Success Rate of the proposed algorithm across variation in a) number of question augmentations per seed question, and, b) similarity of generated question with respect to the seed.
  • Figure 4: Plot showing the average number of generated natural jailbreak prompts per seed prompt per 100 queries for GPT-4-0125-preview model over multiple categories. On average, the proposed approach of Response-Guided Question Augmentation (ReG-QA) produces significantly higher number of jailbreaks when compared to Paraphrasing Based Question Augmentation (Para-QA)
  • Figure 5: Plot showcasing diversity vs. relevance of the generated question augmentations w.r.t the seed question. We calculate relevance using the cosine similarity between the Gecko embeddings corresponding to the seed question and the augmented question. The diversity is calculated by the volume enclosed by the normalized embeddings on the sphere. We present this for two cases: (a) Full question augmentation set, (b) Questions that were successful in jailbreaking GPT-3.5.
  • ...and 1 more figures