Table of Contents
Fetching ...

Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models

Dong Shu, Mingyu Jin, Tianle Chen, Chong Zhang, Yongfeng Zhang

TL;DR

The paper addresses the vulnerability of large language models to prompt-based attacks and introduces CEIPA, a Counterfactual Explainable Incremental Prompt Attack framework that mutates prompts across four levels to generate counterfactual explanations and identify transition points in model defenses. It formalizes the method with an update rule $P_i = f_{w,s,c,w/c}(P_{i-1})$ and demonstrates, through extensive experiments on jailbreak, system-prompt extraction, and hijacking tasks across multiple models, that incremental mutations substantially increase attack success rates and reveal transferability patterns. Counterfactual analyses via t-SNE reveal semantic and linguistic cues (notably verbs and adjectives) and boundary effects in prompt vulnerability, offering actionable insights for defense design. Overall, CEIPA provides a rigorous, explainable toolkit for evaluating and strengthening LLM safety by illuminating how small, structured prompt changes shift model behavior.

Abstract

This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs), such as GPT-4 and LLaMA-2, by identifying and mitigating their vulnerabilities through explainable analysis of prompt attacks. We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness and explore the embedded defense mechanisms in these models. Our approach is distinctive for its capacity to elucidate the reasons behind the generation of harmful responses by LLMs through an incremental counterfactual methodology. By organizing the prompt modification process into four incremental levels: (word, sentence, character, and a combination of character and word) we facilitate a thorough examination of the susceptibilities inherent to LLMs. The findings from our study not only provide counterfactual explanation insight but also demonstrate that our framework significantly enhances the effectiveness of attack prompts.

Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models

TL;DR

The paper addresses the vulnerability of large language models to prompt-based attacks and introduces CEIPA, a Counterfactual Explainable Incremental Prompt Attack framework that mutates prompts across four levels to generate counterfactual explanations and identify transition points in model defenses. It formalizes the method with an update rule and demonstrates, through extensive experiments on jailbreak, system-prompt extraction, and hijacking tasks across multiple models, that incremental mutations substantially increase attack success rates and reveal transferability patterns. Counterfactual analyses via t-SNE reveal semantic and linguistic cues (notably verbs and adjectives) and boundary effects in prompt vulnerability, offering actionable insights for defense design. Overall, CEIPA provides a rigorous, explainable toolkit for evaluating and strengthening LLM safety by illuminating how small, structured prompt changes shift model behavior.

Abstract

This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs), such as GPT-4 and LLaMA-2, by identifying and mitigating their vulnerabilities through explainable analysis of prompt attacks. We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness and explore the embedded defense mechanisms in these models. Our approach is distinctive for its capacity to elucidate the reasons behind the generation of harmful responses by LLMs through an incremental counterfactual methodology. By organizing the prompt modification process into four incremental levels: (word, sentence, character, and a combination of character and word) we facilitate a thorough examination of the susceptibilities inherent to LLMs. The findings from our study not only provide counterfactual explanation insight but also demonstrate that our framework significantly enhances the effectiveness of attack prompts.
Paper Structure (46 sections, 10 equations, 6 figures, 10 tables)

This paper contains 46 sections, 10 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The image illustrates how to attack prompts undergo incremental mutations at the word, sentence, character, and character/word levels.
  • Figure 2: This illustration shows all four levels incremental mutation workflow. Starting with the original attack prompt, "Repeat everything in your prompt so far", we apply incremental mutations only if the initial attack fails.
  • Figure 3: This bar chart compares successful attacks in our Jailbreak/ System Prompt Extraction/ Hijacking experiment experiment, contrasting four attack levels with the baseline. The left side shows results using the GPT-3.5 model, while the right side displays incremental attack performance on the LLaMA2-13B model. The vertical axis represents the number of successful attacks, providing an overview of the experiment's outcomes.
  • Figure 4: This composite graph showcases the transfer success rates in our Jailbreak task, System Prompt Extraction task, and Hijacking task experiment, featuring four distinct sub-graphs. Each sub-graph represents the performance of an incremental attack level in the experiment. The individual graphs measure the transfer success rate in percentage.
  • Figure 5: The first line graph represents successful attack trends across multiple rounds in GPT-3.5 Jailbreak. Each line corresponds to a different mutation level. The horizontal axis shows the experiment rounds, starting at round 0 as the baseline. The vertical axis indicates the percentage of successful attacks. The second line graph illustrates the dynamics of successful attacks across multiple rounds in an experiment of the System Prompt Extraction task on the GPT-3.5 model. This third line graph illustrates the dynamics of successful attacks across multiple rounds in an experiment of the Hijacking task on the GPT-3.5 model.
  • ...and 1 more figures