An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Xilie Xu; Keyi Kong; Ning Liu; Lizhen Cui; Di Wang; Jingfeng Zhang; Mohan Kankanhalli

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, Mohan Kankanhalli

TL;DR

This paper introduces PromptAttack, a prompt-based adversarial auditing technique that converts perturbations into a structured attack prompt (OI/AO/AG) to induce a victim LLM to output adversarial samples that fool itself. A fidelity filter ensures semantic fidelity, while few-shot and ensemble strategies boost attack power. Empirical results on GLUE with Llama2 variants and GPT-3.5 show PromptAttack achieves higher attack success rates than AdvGLUE/AdvGLUE++, and reveal insights such as emoji perturbations misleading GPT-3.5. The work highlights practical risks in safety-critical deployments and provides a scalable tool for robust evaluation and potential defenses against adversarial prompts.

Abstract

The wide-ranging applications of large language models (LLMs), especially in safety-critical domains, necessitate the proper evaluation of the LLM's adversarial robustness. This paper proposes an efficient tool to audit the LLM's adversarial robustness via a prompt-based adversarial attack (PromptAttack). PromptAttack converts adversarial textual attacks into an attack prompt that can cause the victim LLM to output the adversarial sample to fool itself. The attack prompt is composed of three important components: (1) original input (OI) including the original sample and its ground-truth label, (2) attack objective (AO) illustrating a task description of generating a new sample that can fool itself without changing the semantic meaning, and (3) attack guidance (AG) containing the perturbation instructions to guide the LLM on how to complete the task by perturbing the original sample at character, word, and sentence levels, respectively. Besides, we use a fidelity filter to ensure that PromptAttack maintains the original semantic meanings of the adversarial examples. Further, we enhance the attack power of PromptAttack by ensembling adversarial examples at different perturbation levels. Comprehensive empirical results using Llama2 and GPT-3.5 validate that PromptAttack consistently yields a much higher attack success rate compared to AdvGLUE and AdvGLUE++. Interesting findings include that a simple emoji can easily mislead GPT-3.5 to make wrong predictions.

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

TL;DR

Abstract

Paper Structure (54 sections, 3 equations, 5 figures, 19 tables)

This paper contains 54 sections, 3 equations, 5 figures, 19 tables.

Introduction
Related Work
Adversarial attacks.
Robustness evaluation of language models.
LLM's reliability issues.
Prompt-Based Adversarial Attack
Framework of PromptAttack
Original input (OI).
Attack objective (AO).
Attack guidance (AG).
Fidelity Filter
Enhancing PromptAttack
Few-shot strategy.
Ensemble strategy.
Experiments
...and 39 more sections

Figures (5)

Figure 1: Our proposed prompt-based adversarial attack (PromptAttack) against LLMs is composed of three key components: original input, attack objective, and attack guidance.
Figure 2: Our proposed PromptAttack generates an adversarial sample by adding an emoji ":)", which can successfully fool GPT-3.5.
Figure 3: The ASR w.r.t. BERTScore threshold $\tau_2$ evaluated in the SST-2, MNLI-m, and QNLI tasks using GPT-3.5. Extra results evaluated in the MNLI-m, QQP, and RTE tasks are in Figure \ref{['fig:fidelity_bertscore_append']}.
Figure 4: The ASR w.r.t. BERTScore threshold $\tau_2$ evaluated in the MNLI-m, QQP, and RTE tasks using GPT-3.5.
Figure 6: Attack transferability of PromptAttack from GPT-3.5 to Llama2-7B and Llama2-13B.

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

TL;DR

Abstract

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Authors

TL;DR

Abstract

Table of Contents

Figures (5)