Table of Contents
Fetching ...

Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke, Tom Wollschläger, Paul Ungermann, Stephan Günnemann, Leo Schwinn

TL;DR

The paper introduces Inpainting, a diffusion-based framework that models the joint prompt–response distribution $q({\mathbf{x}},{\mathbf{y}})$ to convert costly adversarial prompt search into efficient amortized inference via conditioning on a target response ${\mathbf{y}}^{\star}$. It formalizes fidelity-based guarantees showing that a small number of conditional samples from a surrogate $p_{\theta}({\mathbf{x}}|{\mathbf{y}}^{\star})$ suffices to recover high-reward prompts, enabling transferable attacks across black-box LLMs. Empirically, DLLMs generate low-perplexity, diverse jailbreak prompts that transfer to robust and proprietary models and can be guided further by target-model feedback to boost attack success. The approach suggests broad utility for red-teaming, automated prompt optimization, and leveraging Flow- and Diffusion-based LLMs for robust defense research and adversarial testing.

Abstract

We introduce a novel framework that transforms the resource-intensive (adversarial) prompt optimization problem into an \emph{efficient, amortized inference task}. Our core insight is that pretrained, non-autoregressive generative LLMs, such as Diffusion LLMs, which model the joint distribution over prompt-response pairs, can serve as powerful surrogates for prompt search. This approach enables the direct conditional generation of prompts, effectively replacing costly, per-instance discrete optimization with a small number of parallelizable samples. We provide a probabilistic analysis demonstrating that under mild fidelity assumptions, only a few conditional samples are required to recover high-reward (harmful) prompts. Empirically, we find that the generated prompts are low-perplexity, diverse jailbreaks that exhibit strong transferability to a wide range of black-box target models, including robustly trained and proprietary LLMs. Beyond adversarial prompting, our framework opens new directions for red teaming, automated prompt optimization, and leveraging emerging Flow- and Diffusion-based LLMs.

Diffusion LLMs are Natural Adversaries for any LLM

TL;DR

The paper introduces Inpainting, a diffusion-based framework that models the joint prompt–response distribution to convert costly adversarial prompt search into efficient amortized inference via conditioning on a target response . It formalizes fidelity-based guarantees showing that a small number of conditional samples from a surrogate suffices to recover high-reward prompts, enabling transferable attacks across black-box LLMs. Empirically, DLLMs generate low-perplexity, diverse jailbreak prompts that transfer to robust and proprietary models and can be guided further by target-model feedback to boost attack success. The approach suggests broad utility for red-teaming, automated prompt optimization, and leveraging Flow- and Diffusion-based LLMs for robust defense research and adversarial testing.

Abstract

We introduce a novel framework that transforms the resource-intensive (adversarial) prompt optimization problem into an \emph{efficient, amortized inference task}. Our core insight is that pretrained, non-autoregressive generative LLMs, such as Diffusion LLMs, which model the joint distribution over prompt-response pairs, can serve as powerful surrogates for prompt search. This approach enables the direct conditional generation of prompts, effectively replacing costly, per-instance discrete optimization with a small number of parallelizable samples. We provide a probabilistic analysis demonstrating that under mild fidelity assumptions, only a few conditional samples are required to recover high-reward (harmful) prompts. Empirically, we find that the generated prompts are low-perplexity, diverse jailbreaks that exhibit strong transferability to a wide range of black-box target models, including robustly trained and proprietary LLMs. Beyond adversarial prompting, our framework opens new directions for red teaming, automated prompt optimization, and leveraging emerging Flow- and Diffusion-based LLMs.

Paper Structure

This paper contains 27 sections, 3 theorems, 18 equations, 9 figures, 2 tables.

Key Result

Lemma 1.1

Under the target fidelity and bounded reward assumptions, the difference in expected rewards is bounded by $\varepsilon_2$:

Figures (9)

  • Figure 1: We present Inpainting, a novel framework that reformulates the costly and iterative process of finding adversarial prompts into a simple inference task leveraging pretrained .
  • Figure 2: Where the surrogate ${p_{\theta}}({\mathbf{x}} \mid {\mathbf{y}}^{\star})$ meets high expected reward under a black-box target model ${P{_{f}}}({\mathbf{y}} \mid {\mathbf{x}})$.
  • Figure 3: Efficiency comparison between state-of-the-art LLM attacks and the proposed Inpainting, which achieves near–Pareto-optimal performance in both attack success and generation cost for most models, particularly the robustly trained LAT and Circuit Breakers models.
  • Figure 4:
  • Figure 5: Likelihood guidance improves ASR.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Lemma 1.1: Bounding the Expected Reward Difference
  • proof
  • Lemma 1.2: Set Inclusion
  • proof
  • Theorem 1.3: Probabilistic Lower Bound on Success
  • proof