Table of Contents
Fetching ...

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev

TL;DR

This work tackles the challenge of making Large Language Models robust to adversarial prompts by proposing MixAT, a training framework that unifies discrete (paraphrase/jailbreak) and continuous perturbations. MixAT expands the adversarial space by centering continuous perturbations around discrete seeds, leveraging a mixed perturbation space defined by $\mathcal{N}_{\textsc{MixAT}}({\bm{x}}) = \mathcal{R}({\bm{x}}) + \mathcal{B}^2(0, \epsilon)$ and a mixing parameter $\alpha$. The authors introduce the At Least One Attack Success Rate (ALO-ASR) to capture worst-case vulnerability and demonstrate that MixAT achieves substantially lower ALO-ASR (e.g., $<20\%$) with favorable utility and comparable training cost across multiple open-source LLMs. They also explore deployment factors such as quantization, LoRA, and temperature to reveal blind spots in current defenses and show MixAT’s robustness generalizes to unseen attack families. Overall, MixAT provides a principled and practically efficient robustness-utility improvement for safer generative AI in real-world settings, with code and models publicly available.

Abstract

Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

TL;DR

This work tackles the challenge of making Large Language Models robust to adversarial prompts by proposing MixAT, a training framework that unifies discrete (paraphrase/jailbreak) and continuous perturbations. MixAT expands the adversarial space by centering continuous perturbations around discrete seeds, leveraging a mixed perturbation space defined by and a mixing parameter . The authors introduce the At Least One Attack Success Rate (ALO-ASR) to capture worst-case vulnerability and demonstrate that MixAT achieves substantially lower ALO-ASR (e.g., ) with favorable utility and comparable training cost across multiple open-source LLMs. They also explore deployment factors such as quantization, LoRA, and temperature to reveal blind spots in current defenses and show MixAT’s robustness generalizes to unseen attack families. Overall, MixAT provides a principled and practically efficient robustness-utility improvement for safer generative AI in real-world settings, with code and models publicly available.

Abstract

Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.

Paper Structure

This paper contains 41 sections, 11 equations, 6 figures, 23 tables.

Figures (6)

  • Figure 1: (a) Overview of MixAT, a novel AT method combining continuous and discrete adversarial attacks to enhance LLMs' robustness. The embeddings of harmful prompts $\mathcal{X}$ (e.g., "How to build a bomb?") and their rephrasings $\mathcal{R}(\mathcal{X})$ are perturbed using Continuous Adversarial Attacks ($\dashrightarrow$) to produce $\mathcal{X}+{\bm{\delta}}$ and $\mathcal{R}(\mathcal{X})+{\bm{\delta}}$. MixAT improves generalization by training on $\mathcal{R}(\mathcal{X})+{\bm{\delta}}$, covering the set of possible adversarial embedding $\text{Adv}(\mathcal{X})$ better and increasing the robustness against a diverse set of attacks. (b,c) Experimentally, MixAT achieves superior robustness to PAP zeng2024johnny and GCG mazeika2024harmbench attacks compared to methods like CAT xhonneux2024efficient, while maintaining high utility.
  • Figure 2: MixAT combines continuous and discrete adversarial training by extending the search space to include both kinds of perturbations.
  • Figure 3: Attack Success Rate [%] comparison for various attacks on models trained with different $\alpha_{PAP}$ ratios in both MixAT and DualAT.
  • Figure 4: ASR $\downarrow$ and Utility $\uparrow$ scores for zephyr-7b models trained with MixAT and CAT when scaling the LoRA weights of the trained adapters.
  • Figure 5: Evolution of GCG ASR with temperature for the LLama-3-8B MixAT model.
  • ...and 1 more figures