Table of Contents
Fetching ...

On Discrete Prompt Optimization for Diffusion Models

Ruochen Wang, Ting Liu, Cho-Jui Hsieh, Boqing Gong

TL;DR

This paper tackles the problem of aligning text prompts with diffusion-model outputs by formulating prompt engineering as a discrete optimization over natural language. It introduces DPO-Diff, a gradient-based framework that uses compact, dynamically generated word subspaces and a novel Shortcut Text Gradient to backpropagate through diffusion inference with constant memory. The approach supports both prompt enhancement and adversarial prompt discovery, leveraging Gumbel-Softmax relaxation and Evolutionary Search to efficiently explore candidate prompts; negative prompts via Antonym Space notably outperform positive synonym-based prompts in practice. Empirical results across DiffusionDB, COCO, and ChatGPT-derived prompts show that DPO-Diff can outperform human-engineered prompts and prior baselines in faithfulness and attack efficacy, with supportive human evaluations. The work highlights a complementary paradigm to learning-based methods, offering a train-free, scalable avenue for prompt optimization in text-to-image diffusion systems, with implications for debugging, safety, and content quality.

Abstract

This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently finding a solution to this problem: (1) Enormous Domain Space: Setting the domain to the entire language space poses significant difficulty to the optimization process. (2) Text Gradient: Efficiently computing the text gradient is challenging, as it requires backpropagating through the inference steps of the diffusion model and a non-differentiable embedding lookup table. Beyond the problem formulation, our main technical contributions lie in solving the above challenges. First, we design a family of dynamically generated compact subspaces comprised of only the most relevant words to user input, substantially restricting the domain space. Second, we introduce "Shortcut Text Gradient" -- an effective replacement for the text gradient that can be obtained with constant memory and runtime. Empirical evaluation on prompts collected from diverse sources (DiffusionDB, ChatGPT, COCO) suggests that our method can discover prompts that substantially improve (prompt enhancement) or destroy (adversarial attack) the faithfulness of images generated by the text-to-image diffusion model.

On Discrete Prompt Optimization for Diffusion Models

TL;DR

This paper tackles the problem of aligning text prompts with diffusion-model outputs by formulating prompt engineering as a discrete optimization over natural language. It introduces DPO-Diff, a gradient-based framework that uses compact, dynamically generated word subspaces and a novel Shortcut Text Gradient to backpropagate through diffusion inference with constant memory. The approach supports both prompt enhancement and adversarial prompt discovery, leveraging Gumbel-Softmax relaxation and Evolutionary Search to efficiently explore candidate prompts; negative prompts via Antonym Space notably outperform positive synonym-based prompts in practice. Empirical results across DiffusionDB, COCO, and ChatGPT-derived prompts show that DPO-Diff can outperform human-engineered prompts and prior baselines in faithfulness and attack efficacy, with supportive human evaluations. The work highlights a complementary paradigm to learning-based methods, offering a train-free, scalable avenue for prompt optimization in text-to-image diffusion systems, with implications for debugging, safety, and content quality.

Abstract

This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently finding a solution to this problem: (1) Enormous Domain Space: Setting the domain to the entire language space poses significant difficulty to the optimization process. (2) Text Gradient: Efficiently computing the text gradient is challenging, as it requires backpropagating through the inference steps of the diffusion model and a non-differentiable embedding lookup table. Beyond the problem formulation, our main technical contributions lie in solving the above challenges. First, we design a family of dynamically generated compact subspaces comprised of only the most relevant words to user input, substantially restricting the domain space. Second, we introduce "Shortcut Text Gradient" -- an effective replacement for the text gradient that can be obtained with constant memory and runtime. Empirical evaluation on prompts collected from diverse sources (DiffusionDB, ChatGPT, COCO) suggests that our method can discover prompts that substantially improve (prompt enhancement) or destroy (adversarial attack) the faithfulness of images generated by the text-to-image diffusion model.
Paper Structure (71 sections, 1 theorem, 13 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 71 sections, 1 theorem, 13 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Proposition 3.1

The original parameterization of DDPM at step $t-K$: $\bm\mu_\theta(\bm{x}_{t-K}, t-K) = \frac{1}{\sqrt{\alpha_{t-K}}}(\bm{x}_{t-K} - \frac{\beta_{t-K}}{\sqrt{1 - \Bar\alpha_{t-K}}}\bm\epsilon_\theta(\bm{x}_{t-K}, t-K))$ can be viewed as first computing an estimate of $x_0$ from the current-step err And use the estimate to compute the transition probability $q(\bm{x}_{t-K}|\bm{x}_{t-K}, \bm{x}_0)$

Figures (10)

  • Figure 1: Computational procedure of Shortcut Text Gradient (Bottom) v.s. Full Gradient (Top) on text.
  • Figure 2: Win Rate of DPO-Diff versus Promptist on prompt improvement task with Human Evaluation. DPO-Diff surpasses or matches the performance of Promptist 79% of times on SD-v1 and 88% of times on SD-XL.
  • Figure 3: Example images generated by improved negative prompts from DPO-Diff v.s. Promptist (More in \ref{['fig:improve_more']}). Compared with Promptist, DPO-Diff was able to generate images that better capture the content in the original prompt.
  • Figure 4: Example images generated by adversarial prompts from DPO-Diff. While keeping the overall meaning similar to the user input, adversarial prompts completely destroy the prompt-following ability of the Stable Diffusion model. (More in \ref{['fig:attack_more']})
  • Figure 5: Evolution of the optimized images from DPO-Diff at iteration 0, 10, 20, 40, and 80 (left to right). Noticeable improvements can be observed as early as 10 iterations, and the progression is surprisingly interpretable.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 4.1
  • Proposition 3.1
  • proof