Table of Contents
Fetching ...

Closing the Distribution Gap in Adversarial Training for LLMs

Chengzhi Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn

TL;DR

This work addresses a core limitation of adversarial training for LLMs: the empirical training distribution fails to cover the true data distribution, leaving data-specific vulnerabilities unaddressed. The authors propose Distributional Adversarial Training (DAT), which uses diffusion LLMs to sample from the joint distribution $q(x,y)$ by conditionally generating high-likelihood, data-specific adversarial prompts and pairs this with continuous adversarial training to minimize the worst-case loss. They provide a fidelity-based theoretical bound showing that better diffusion surrogate accuracy tightens the gap between the population risk $\\mathcal{R}_{pop}$ and the surrogate risk $\\mathcal{R}_{diff}$, and they demonstrate empirically that DAT yields substantial improvements in worst-case robustness across two LLMs while preserving utility. Overall, DAT offers a practical and scalable approach to safer LLMs by explicitly aligning training dynamics with the natural language data distribution and adversarial vulnerabilities within it.

Abstract

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.

Closing the Distribution Gap in Adversarial Training for LLMs

TL;DR

This work addresses a core limitation of adversarial training for LLMs: the empirical training distribution fails to cover the true data distribution, leaving data-specific vulnerabilities unaddressed. The authors propose Distributional Adversarial Training (DAT), which uses diffusion LLMs to sample from the joint distribution by conditionally generating high-likelihood, data-specific adversarial prompts and pairs this with continuous adversarial training to minimize the worst-case loss. They provide a fidelity-based theoretical bound showing that better diffusion surrogate accuracy tightens the gap between the population risk and the surrogate risk , and they demonstrate empirically that DAT yields substantial improvements in worst-case robustness across two LLMs while preserving utility. Overall, DAT offers a practical and scalable approach to safer LLMs by explicitly aligning training dynamics with the natural language data distribution and adversarial vulnerabilities within it.

Abstract

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.
Paper Structure (35 sections, 1 theorem, 27 equations, 5 figures, 4 tables)

This paper contains 35 sections, 1 theorem, 27 equations, 5 figures, 4 tables.

Key Result

Theorem 3.1

Assume the robust loss is bounded, i.e., $|\ell_{rob}(x,y;\theta)| \le M$ for all $(x,y)\in\mathcal{Z}$, and the diffusion surrogate satisfies the conditional fidelity assumption Then

Figures (5)

  • Figure 1: Standard AT minimizes the empirical robust risk over a fixed dataset $\mathcal{D}$ (brown), which provides a poor approximation of the population robust risk. This results in a distribution gap where the model remains vulnerable to the manifold of natural language $q$ (blue). Specifically, standard methods fail to cover the distribution of prompts $\tilde{q}(x\mid y_{harm})$ (green) that are likely to trigger harmful responses. Our DAT framework bridges this gap by optimizing over a surrogate distribution defined by a diffusion LLM $p^{\mathrm{diff}}_{\theta}(x | y_{\mathrm{hamr}})$ (purple), allowing the model to train on a distribution that more closely matches the true population.
  • Figure 2: Cumulative transfer ASR across five target models (Gemma3-12B team2025gemma, Qwen2.5-7Bqwen2025qwen25technicalreport, Zephyr-7Btunstall2023zephyr, Llama3-8B-LAT sheshadri2024latent, Llama3-8B-CB zou2024circuitbreaker) from attacks on Llama3-8B. Diffusion-based Inpainting attacks exhibit significantly higher transferability than model-specific optimization (GCG) or heuristic perturbations (BoN), suggesting that conditional sampling from the diffusion surrogate effectively identifies data-specific vulnerabilities that generalize across architectures and defenses.
  • Figure 3: Diversity of generated attack strings measured using SBERT embeddings (all-MiniLM-L6-v2; reimers-gurevych-2019-sentence). Each cell reports the mean pairwise cosine similarity between samples generated by two methods. Diffusion-based attacks exhibit the lowest intra-method similarity (0.178), indicating substantially greater sample diversity than GCG and BoN.
  • Figure 4: Pareto frontier for Llama3-8B showing the trade-off between Inpainting Robustness ($1-\text{ASR}$) and XSTest compliance rate. DAT achieves superior trade-offs across all hyperparameter settings.
  • Figure 5: Inpainting ASR as a function of the number of unique diffusion samples $M$ per behavior in the training data. Robustness gradually improves as we better approximate the population risk $q(x,y)$, while helpfulness stays consistent.

Theorems & Definitions (2)

  • Theorem 3.1: Surrogate Fidelity Bound
  • proof