Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke; Tom Wollschläger; Paul Ungermann; Stephan Günnemann; Leo Schwinn

Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke, Tom Wollschläger, Paul Ungermann, Stephan Günnemann, Leo Schwinn

TL;DR

The paper introduces Inpainting, a diffusion-based framework that models the joint prompt–response distribution $q({\mathbf{x}},{\mathbf{y}})$ to convert costly adversarial prompt search into efficient amortized inference via conditioning on a target response ${\mathbf{y}}^{\star}$. It formalizes fidelity-based guarantees showing that a small number of conditional samples from a surrogate $p_{\theta}({\mathbf{x}}|{\mathbf{y}}^{\star})$ suffices to recover high-reward prompts, enabling transferable attacks across black-box LLMs. Empirically, DLLMs generate low-perplexity, diverse jailbreak prompts that transfer to robust and proprietary models and can be guided further by target-model feedback to boost attack success. The approach suggests broad utility for red-teaming, automated prompt optimization, and leveraging Flow- and Diffusion-based LLMs for robust defense research and adversarial testing.

Abstract

We introduce a novel framework that transforms the resource-intensive (adversarial) prompt optimization problem into an \emph{efficient, amortized inference task}. Our core insight is that pretrained, non-autoregressive generative LLMs, such as Diffusion LLMs, which model the joint distribution over prompt-response pairs, can serve as powerful surrogates for prompt search. This approach enables the direct conditional generation of prompts, effectively replacing costly, per-instance discrete optimization with a small number of parallelizable samples. We provide a probabilistic analysis demonstrating that under mild fidelity assumptions, only a few conditional samples are required to recover high-reward (harmful) prompts. Empirically, we find that the generated prompts are low-perplexity, diverse jailbreaks that exhibit strong transferability to a wide range of black-box target models, including robustly trained and proprietary LLMs. Beyond adversarial prompting, our framework opens new directions for red teaming, automated prompt optimization, and leveraging emerging Flow- and Diffusion-based LLMs.

Diffusion LLMs are Natural Adversaries for any LLM

TL;DR

Abstract

Diffusion LLMs are Natural Adversaries for any LLM

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (6)