Table of Contents
Fetching ...

A methodological analysis of prompt perturbations and their effect on attack success rates

Tiago Machado, Maysa Malfiza Garcia de Macedo, Rogerio Abreu de Paula, Marcelo Carpinette Grave, Aminat Adebiyi, Luan Soares de Souza, Enrico Santarelli, Claudio Pinhanez

TL;DR

This study investigates how prompt perturbations interact with LLM alignment methods (SFT, RLHF, DPO) to influence attack success rates (ASR). Through AdvBench, seven open-source models, and nonparametric statistics, it demonstrates that ASR varies significantly with prompt structure and alignment method, undermining the notion of ASR invariance. The findings highlight the risk of relying on a single prompt configuration or benchmark for safety assessment and advocate for multidimensional, statistically robust evaluation frameworks. The work advances understanding of jailbreak robustness and informs safer design and evaluation of alignment strategies for open-source LLMs, with practical implications for deployment and benchmarking.

Abstract

This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models' responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing 'attack benchmarks' alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.

A methodological analysis of prompt perturbations and their effect on attack success rates

TL;DR

This study investigates how prompt perturbations interact with LLM alignment methods (SFT, RLHF, DPO) to influence attack success rates (ASR). Through AdvBench, seven open-source models, and nonparametric statistics, it demonstrates that ASR varies significantly with prompt structure and alignment method, undermining the notion of ASR invariance. The findings highlight the risk of relying on a single prompt configuration or benchmark for safety assessment and advocate for multidimensional, statistically robust evaluation frameworks. The work advances understanding of jailbreak robustness and informs safer design and evaluation of alignment strategies for open-source LLMs, with practical implications for deployment and benchmarking.

Abstract

This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models' responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing 'attack benchmarks' alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.

Paper Structure

This paper contains 49 sections, 5 figures, 28 tables.

Figures (5)

  • Figure 1: Overview of the proposed methodology to prove the two hypotheses. The flowchart for experiment 1 describes how to prove hypothesis 1. It starts with prompts from the adv-bench dataset, an alignment sentence can be attached to the prompt followed by an instruction template and an attack type. Note that each leaf is a prompt configuration, that will be sent to a LLM that will produce outputs. The outputs will be classified by a judge according how safe they are. Classifications can be statistically compared for testing if there are significant differences between them. Finally, the classification results in the ASR. The flowchart on the right side describes the experiment 2. It is analogue to experiment 1 but it is performed for two models aligned with different alignment methods.
  • Figure 2: Experiment 2 - Two models based on the same architecture, one aligned with DPO and other aligned with SFT, are compared. In the upper row, both models are attacked using prompts with the same configuration (i.e., using alignment prompt and using instruction template). The bars show the results of how many attack prompts successfully made the model produce unsafe content (scores 3-5 according the rubric used by a LLM-as-a-judge). While the DPO model refused to respond the attack prompt, the opposite happened for the SFT version in the four types of tested attacks (emotional, adversarial token, Base64, and BoN). In the lower row, the only difference in the prompt configuration is the absence of an instruction template. The bars shows that only this change is responsible for significant differences in the distribution of how successful a dataset of prompt attacks is.
  • Figure 3: Figure illustrating the methodology study modified to use judge models based on binary category range instead of ordinal ranges going from 1 to 5.
  • Figure 4: Experiment 2 for LLama-2-7b-chat-hf and Vicuna-7b-v1.5 - Two models based on the same architecture, one aligned with RLHF (LLama-2) and other aligned with SFT (Vicuna), are compared. In the upper row, both models are attacked using prompts with the same configuration (i.e., using alignment prompt and using instruction template). The bars show the results of how many attack prompts successfully made the model produce unsafe content (scores 3-5 according the rubric used by a LLM-as-a-judge). While the RLHF model refused to respond the attack prompts for almost the whole dataset, the opposite happened for the SFT version in the three types of tested attacks (emotional, adversarial token, and BoN), also showing differences due the presence of a template instruction. Such differences is also perceived in the LLama2 model. However, less often, when taking a look at the charts for the adversarial token and Base64 attacks. In the lower row, the only difference in the prompt configuration is the absence of an instruction template. The bars shows that only this change is responsible for significant differences in the distribution of how successful a dataset of prompt attacks is, making the RLHF and SFT versions more skewed to the safe side of the charts. Although the RLHF version is less variant, we still can see differences, and statistically they are significant. For example, the presence of a template shifts the score distribution to the unsafe side in all of the tested attacks.
  • Figure 5: Experiment 2 for Mistral-7b-anthropic and Mistral-7b-sft-constitutional-ai - Two models based on the same architecture (Mistral-7B-v0.1), one aligned with DPO (Mistral-Anth) and other aligned with SFT (Mistral-CAI), are compared. In the upper row, both models are attacked using prompts with the same configuration (i.e., using alignment prompt and using instruction template). The bars show the results of how many attack prompts successfully made the model produce unsafe content (scores 3-5 according the rubric used by a LLM-as-a-judge). As well as in the other examples, only the presence of an instruction template in the prompt was enough to provoke significant differences between alignment methods, what can be better noticeable in the upper row for the adversarial, Base64, and BoN attacks.