Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang
TL;DR
This work tackles the vulnerability of aligned LLMs to semantic jailbreak attacks by proposing SemanticSmooth, a smoothing-based defense that aggregates predictions over semantically transformed copies of each input. It introduces semantics-preserving transformations and an adaptive transformation policy to maintain nominal performance while boosting robustness against attacks like GCG, PAIR, and AutoDAN. The approach achieves state-of-the-art robustness across multiple LLMs and benchmarks (InstructionFollow, AlpacaEval) with favorable trade-offs compared to baselines, and provides a first quantitative interpretability analysis of the GCG attack. The authors also discuss practical considerations, including computation costs and dependency on the target model, and offer insights into the underlying attack strategies via semantic transformations.
Abstract
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.
