Towards the Worst-case Robustness of Large Language Models

Huanran Chen; Yinpeng Dong; Zeming Wei; Hang Su; Jun Zhu

Towards the Worst-case Robustness of Large Language Models

Huanran Chen, Yinpeng Dong, Zeming Wei, Hang Su, Jun Zhu

TL;DR

The paper tackles the problem of worst-case robustness in large language models by developing a strong adaptive white-box attack, I$^2$-GCG, that enforces token-consistent optimization and inference, revealing vulnerabilities in most deterministic defenses. It then introduces a theoretical framework to derive tight lower bounds for randomized smoothing using Fractional Knapsack (bounded $f$) and 0-1 Knapsack (binary $f$), enabling black-box certified robustness for stochastic defenses. Through two case studies on Absorbing and Uniform kernels, the authors prove that Uniform kernels yield stronger certified robustness and, in the limit of large vocabulary, converge to Absorbing bounds, providing insights into kernel choice. Empirically, DiffTextPure demonstrates strong defense against optimization-based attacks in black-box settings, while the certification results quantify nontrivial radii (e.g., $2.02$ for $\ell_0$ and $6.41$ for suffix attacks) under AdvBench, highlighting the practical implications for robust deployment of LLMs and the need for improved certification methods.

Abstract

Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against \textit{any possible attack} with an average $\ell_0$ perturbation of 2.02 or an average suffix length of 6.41.

Towards the Worst-case Robustness of Large Language Models

TL;DR

The paper tackles the problem of worst-case robustness in large language models by developing a strong adaptive white-box attack, I

-GCG, that enforces token-consistent optimization and inference, revealing vulnerabilities in most deterministic defenses. It then introduces a theoretical framework to derive tight lower bounds for randomized smoothing using Fractional Knapsack (bounded

) and 0-1 Knapsack (binary

), enabling black-box certified robustness for stochastic defenses. Through two case studies on Absorbing and Uniform kernels, the authors prove that Uniform kernels yield stronger certified robustness and, in the limit of large vocabulary, converge to Absorbing bounds, providing insights into kernel choice. Empirically, DiffTextPure demonstrates strong defense against optimization-based attacks in black-box settings, while the certification results quantify nontrivial radii (e.g.,

for

and

for suffix attacks) under AdvBench, highlighting the practical implications for robust deployment of LLMs and the need for improved certification methods.

Abstract

perturbation of 2.02 or an average suffix length of 6.41.

Towards the Worst-case Robustness of Large Language Models

TL;DR

Abstract

Towards the Worst-case Robustness of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (21)