Efficient Adversarial Training in LLMs with Continuous Attacks
Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, Leo Schwinn
TL;DR
This work tackles the high computational cost of adversarial training for LLMs by shifting attacks to continuous embedding space and introducing two techniques, CAT and CAPO. CAT combines adversarial behavior data with utility fine-tuning, while CAPO adapts IPO to adversarial alignment without requiring utility data. Empirical results across multiple models and scales show that continuous adversarial training substantially improves robustness against discrete attacks such as GCG, AutoDAN, and PAIR while maintaining utility, with CAPO delivering strong performance without extra utility data. The findings demonstrate that robustness to continuous perturbations can generalize to discrete jailbreaks, offering a scalable path to robustly aligning LLMs, though careful evaluation and dataset design remain crucial.
Abstract
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.
