Table of Contents
Fetching ...

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang

TL;DR

This work tackles the vulnerability of aligned LLMs to semantic jailbreak attacks by proposing SemanticSmooth, a smoothing-based defense that aggregates predictions over semantically transformed copies of each input. It introduces semantics-preserving transformations and an adaptive transformation policy to maintain nominal performance while boosting robustness against attacks like GCG, PAIR, and AutoDAN. The approach achieves state-of-the-art robustness across multiple LLMs and benchmarks (InstructionFollow, AlpacaEval) with favorable trade-offs compared to baselines, and provides a first quantitative interpretability analysis of the GCG attack. The authors also discuss practical considerations, including computation costs and dependency on the target model, and offer insights into the underlying attack strategies via semantic transformations.

Abstract

Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

TL;DR

This work tackles the vulnerability of aligned LLMs to semantic jailbreak attacks by proposing SemanticSmooth, a smoothing-based defense that aggregates predictions over semantically transformed copies of each input. It introduces semantics-preserving transformations and an adaptive transformation policy to maintain nominal performance while boosting robustness against attacks like GCG, PAIR, and AutoDAN. The approach achieves state-of-the-art robustness across multiple LLMs and benchmarks (InstructionFollow, AlpacaEval) with favorable trade-offs compared to baselines, and provides a first quantitative interpretability analysis of the GCG attack. The authors also discuss practical considerations, including computation costs and dependency on the target model, and offer insights into the underlying attack strategies via semantic transformations.

Abstract

Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.
Paper Structure (39 sections, 4 equations, 4 figures, 11 tables)

This paper contains 39 sections, 4 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Illustration of SemanticSmooth. Given an input, the transformation selector will sample multiple transformations $\{{T}^{(i)}\}$ that will be applied to the input. The transformed prompts $\{\bm x^{(i)}\}$ will be fed into the LLM independently. These model generations $\{\bm y^{(i)}\}$ are then aggregated to get the final response.
  • Figure 2: Robustness trade-offs.PolicyEnsemble achieves a strong trade-off (The further towards the top left corner of the chart, the better the performance). We plot the ASR on the horizontal axis against the benign performance of AlpacaEval dataset on the vertical axis, which visualizes the trade-off between robustness and nominal performance for Vicuna. Notice that PolicyEnsemble outperforms most baselines in terms of robustness and achieves the highest nominal performance.
  • Figure 3: Learned policy distribution. Transformations that tend to change input significantly are favored for jailbreaking prompts (GCG, PAIR, AutoDAN), whereas transformations that introduce minor changes are favored for benign instructions (Inst, AlpacaEval). We plot the average learned policy distribution over the transformations in $\mathcal{T}$ for Vicuna on the evaluation dataset.
  • Figure 4: An example Mturk page for the human study of explaining GCG attack instruction with semantic transformation. The selected instruction is highlighted with a red borderline.