Table of Contents
Fetching ...

Efficient Adversarial Training in LLMs with Continuous Attacks

Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, Leo Schwinn

TL;DR

This work tackles the high computational cost of adversarial training for LLMs by shifting attacks to continuous embedding space and introducing two techniques, CAT and CAPO. CAT combines adversarial behavior data with utility fine-tuning, while CAPO adapts IPO to adversarial alignment without requiring utility data. Empirical results across multiple models and scales show that continuous adversarial training substantially improves robustness against discrete attacks such as GCG, AutoDAN, and PAIR while maintaining utility, with CAPO delivering strong performance without extra utility data. The findings demonstrate that robustness to continuous perturbations can generalize to discrete jailbreaks, offering a scalable path to robustly aligning LLMs, though careful evaluation and dataset design remain crucial.

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.

Efficient Adversarial Training in LLMs with Continuous Attacks

TL;DR

This work tackles the high computational cost of adversarial training for LLMs by shifting attacks to continuous embedding space and introducing two techniques, CAT and CAPO. CAT combines adversarial behavior data with utility fine-tuning, while CAPO adapts IPO to adversarial alignment without requiring utility data. Empirical results across multiple models and scales show that continuous adversarial training substantially improves robustness against discrete attacks such as GCG, AutoDAN, and PAIR while maintaining utility, with CAPO delivering strong performance without extra utility data. The findings demonstrate that robustness to continuous perturbations can generalize to discrete jailbreaks, offering a scalable path to robustly aligning LLMs, though careful evaluation and dataset design remain crucial.

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.
Paper Structure (65 sections, 21 equations, 5 figures, 11 tables)

This paper contains 65 sections, 21 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: We propose continuous adversarial training (AT) to address the large computational requirements of existing discrete AT approaches mazeika2024harmbench. We demonstrate that robustness against continuous attacks successfully extrapolates to discrete threats, such as suffix and jailbreaking attacks while being considerably faster to compute.
  • Figure 2: Trade-off between utility and robustness for CAT (Eq. \ref{['eq:ul+utility']}), CAPO (Eq. \ref{['eq:adv dpo']}), and R2D2 mazeika2024harmbench, compared to their non-adversarially fine-tuned models. The objective is a small loss in utility and a large improvement in attack robustness. Larger is better for MMLU, Arc-E, Arc-C, MT-Bench (left of dashed line). Smaller is better for GCG, AutoDAN, and PAIR (right of dashed line). MT-Bench score is multiplied by 10 to see the change in performance on this $y$-axis. Additional results are included in App. \ref{['app:mainresults']}.
  • Figure 3: Ablating how changing $\beta$ or $\epsilon{}$ affect GCG loss vs MMLU score on Gemma-IPO
  • Figure 4: Gemma-IPO used for both plots: (a) Correlation between GCG loss and continuous attack loss. (b) GCG loss vs MMLU score for a variety of $\epsilon$ and $\beta$ values.
  • Figure 5: (a-b) Cross entropy loss of an embedding attack performed in an $\epsilon$-ball around the instruction embeddings. The same $\epsilon$ as during training is used. For the base models, we use $\epsilon = 0.05$. (c) For unconstrained attacks, the loss converges to $0$ for all models, showing that gradient obfuscation is not an issue during attack optimization. The black dashed line indicates the threshold, where an affirmative response is achieved for all toxic queries.