Closing the Distribution Gap in Adversarial Training for LLMs
Chengzhi Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn
TL;DR
This work addresses a core limitation of adversarial training for LLMs: the empirical training distribution fails to cover the true data distribution, leaving data-specific vulnerabilities unaddressed. The authors propose Distributional Adversarial Training (DAT), which uses diffusion LLMs to sample from the joint distribution $q(x,y)$ by conditionally generating high-likelihood, data-specific adversarial prompts and pairs this with continuous adversarial training to minimize the worst-case loss. They provide a fidelity-based theoretical bound showing that better diffusion surrogate accuracy tightens the gap between the population risk $\\mathcal{R}_{pop}$ and the surrogate risk $\\mathcal{R}_{diff}$, and they demonstrate empirically that DAT yields substantial improvements in worst-case robustness across two LLMs while preserving utility. Overall, DAT offers a practical and scalable approach to safer LLMs by explicitly aligning training dynamics with the natural language data distribution and adversarial vulnerabilities within it.
Abstract
Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.
