Table of Contents
Fetching ...

Target-driven Attack for Large Language Models

Chong Zhang, Mingyu Jin, Dong Shu, Taowen Wang, Dongfang Liu, Xiaobo Jin

TL;DR

This work proposes a target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack's goal, and transforms the distance maximization problem into two convex optimization problems based on the attack goal to solve the attack text and estimate the covariance.

Abstract

Current large language models (LLM) provide a strong foundation for large-scale user-oriented natural language tasks. Many users can easily inject adversarial text or instructions through the user interface, thus causing LLM model security challenges like the language model not giving the correct answer. Although there is currently a large amount of research on black-box attacks, most of these black-box attacks use random and heuristic strategies. It is unclear how these strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we propose our target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack's goal. We transform the distance maximization problem into two convex optimization problems based on the attack goal to solve the attack text and estimate the covariance. Furthermore, the projected gradient descent algorithm solves the vector corresponding to the attack text. Our target-driven black-box attack approach includes two attack strategies: token manipulation and misinformation attack. Experimental results on multiple Large Language Models and datasets demonstrate the effectiveness of our attack method.

Target-driven Attack for Large Language Models

TL;DR

This work proposes a target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack's goal, and transforms the distance maximization problem into two convex optimization problems based on the attack goal to solve the attack text and estimate the covariance.

Abstract

Current large language models (LLM) provide a strong foundation for large-scale user-oriented natural language tasks. Many users can easily inject adversarial text or instructions through the user interface, thus causing LLM model security challenges like the language model not giving the correct answer. Although there is currently a large amount of research on black-box attacks, most of these black-box attacks use random and heuristic strategies. It is unclear how these strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we propose our target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack's goal. We transform the distance maximization problem into two convex optimization problems based on the attack goal to solve the attack text and estimate the covariance. Furthermore, the projected gradient descent algorithm solves the vector corresponding to the attack text. Our target-driven black-box attack approach includes two attack strategies: token manipulation and misinformation attack. Experimental results on multiple Large Language Models and datasets demonstrate the effectiveness of our attack method.

Paper Structure

This paper contains 32 sections, 1 theorem, 48 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

For any two continuous probability distributions $p(y|x)$ and $p(y|x')$, we have where

Figures (7)

  • Figure 1: Overview of target-driven black box attack methods: token manipulation method, which replaces the subject, predicate, and object in the clean text with synonyms to obtain multiple candidate attack texts; misinformation attack method, which generates multiple candidate attack texts through jailbreak templates and assistant models and inserts prompts into the clean text to obtain multiple candidate attack texts. The text closest to the vector solved by our algorithm will be used as the final attack text.
  • Figure 2: Assuming $z^* = (z_1,z_2)$ is the optimal solution to problem (\ref{['eqn:opt-prob-fix-sigma']}), then when $z$ moves from $A$ through $z^*$ to $B$ on the ellipse, $\cos(z,x)$ will continue to increase, but $\|z\|_2$ first decreases and then increases.
  • Figure 3: Transfer Success Rate (TSR) heatmap on our method with token manipulation attack. The rows and columns represent the attack model and defense model, respectively.
  • Figure 4: Transfer Success Rate (TSR) heatmap on our method with misleading adversarial attack. The rows and columns represent the attack model and defense model, respectively.
  • Figure 5: When $z^* = (z_1,z_2)$ is the optimal solution to problem (\ref{['eqn:joint-optimization']}), then the vector $\overrightarrow{zo}$ and the normal vector of the ellipse at point $z^*$ have the same direction.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof