Table of Contents
Fetching ...

Towards the Worst-case Robustness of Large Language Models

Huanran Chen, Yinpeng Dong, Zeming Wei, Hang Su, Jun Zhu

TL;DR

The paper tackles the problem of worst-case robustness in large language models by developing a strong adaptive white-box attack, I$^2$-GCG, that enforces token-consistent optimization and inference, revealing vulnerabilities in most deterministic defenses. It then introduces a theoretical framework to derive tight lower bounds for randomized smoothing using Fractional Knapsack (bounded $f$) and 0-1 Knapsack (binary $f$), enabling black-box certified robustness for stochastic defenses. Through two case studies on Absorbing and Uniform kernels, the authors prove that Uniform kernels yield stronger certified robustness and, in the limit of large vocabulary, converge to Absorbing bounds, providing insights into kernel choice. Empirically, DiffTextPure demonstrates strong defense against optimization-based attacks in black-box settings, while the certification results quantify nontrivial radii (e.g., $2.02$ for $\ell_0$ and $6.41$ for suffix attacks) under AdvBench, highlighting the practical implications for robust deployment of LLMs and the need for improved certification methods.

Abstract

Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against \textit{any possible attack} with an average $\ell_0$ perturbation of 2.02 or an average suffix length of 6.41.

Towards the Worst-case Robustness of Large Language Models

TL;DR

The paper tackles the problem of worst-case robustness in large language models by developing a strong adaptive white-box attack, I-GCG, that enforces token-consistent optimization and inference, revealing vulnerabilities in most deterministic defenses. It then introduces a theoretical framework to derive tight lower bounds for randomized smoothing using Fractional Knapsack (bounded ) and 0-1 Knapsack (binary ), enabling black-box certified robustness for stochastic defenses. Through two case studies on Absorbing and Uniform kernels, the authors prove that Uniform kernels yield stronger certified robustness and, in the limit of large vocabulary, converge to Absorbing bounds, providing insights into kernel choice. Empirically, DiffTextPure demonstrates strong defense against optimization-based attacks in black-box settings, while the certification results quantify nontrivial radii (e.g., for and for suffix attacks) under AdvBench, highlighting the practical implications for robust deployment of LLMs and the need for improved certification methods.

Abstract

Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against \textit{any possible attack} with an average perturbation of 2.02 or an average suffix length of 6.41.

Paper Structure

This paper contains 80 sections, 9 theorems, 99 equations, 2 figures, 7 tables, 3 algorithms.

Key Result

Theorem 4.3

(Proof in appendix:proof:theorem:prove_knapsack and aho1974design). algorithm:certify_knapsack exactly solves the functional minimization part in eq:randomized_smoothing_relax.

Figures (2)

  • Figure 1: Comparison of $p_{adv} - p_A$ plots for the absorbing kernel and the uniform kernel, illustrating the Knapsack algorithm. $p_{adv}$ is plotted on the vertical axis, and $p_A$ on the horizontal axis. When the vocabulary size $|\mathcal{V}|$ increases, the $p_{adv} - p_A$ of the uniform kernel gradually shifts downward and to the right, eventually matching that of the absorbing kernel.
  • Figure 2: Comparison between DiffPure nie2022diffpure and DiffTextPure using diffusion language models.

Theorems & Definitions (21)

  • Definition 4.1
  • Definition 4.2
  • Theorem 4.3
  • Definition 4.4
  • Definition 5.1
  • Theorem 5.2
  • Definition 5.3
  • Theorem 5.4
  • Theorem 5.5
  • Remark C.1
  • ...and 11 more