Table of Contents
Fetching ...

FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models

Changgu Chen, Libing Yang, Xiaoyan Yang, Lianggangxu Chen, Gaoqi He, CHangbo Wang, Yang Li

TL;DR

FIND addresses prompt-content misalignment in diffusion models by directly optimizing the initial noise distribution through policy gradients treated as a one-step MDP. A dynamic reward calibration module and a ratio clipping scheme stabilize learning and reuse historical data to improve efficiency. In experiments on text-to-image and text-to-video tasks, FIND achieves superior prompt consistency with substantial speedups over RL-based baselines. This approach enables efficient, black-box alignment of diffusion outputs to prompts without retraining the underlying model, broadening practical deployment across modalities.

Abstract

In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific intervals of the initial noise distribution corresponding to the prompt. Moreover, it is challenging to directly optimize the initial distribution, given that the diffusion process involves multiple denoising steps. In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. To this end, we first reformulate the diffusion denoising procedure as a one-step Markov decision process and employ policy optimization to directly optimize the initial distribution. In addition, a dynamic reward calibration module is proposed to ensure training stability during optimization. Furthermore, we introduce a ratio clipping algorithm to utilize historical data for network training and prevent the optimized distribution from deviating too far from the original policy to restrain excessive optimization magnitudes. Extensive experiments demonstrate the effectiveness of our method in both text-to-image and text-to-video tasks, surpassing SOTA methods in achieving consistency between prompts and the generated content. Our method achieves 10 times faster than the SOTA approach. Our homepage is available at \url{https://github.com/vpx-ecnu/FIND-website}.

FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models

TL;DR

FIND addresses prompt-content misalignment in diffusion models by directly optimizing the initial noise distribution through policy gradients treated as a one-step MDP. A dynamic reward calibration module and a ratio clipping scheme stabilize learning and reuse historical data to improve efficiency. In experiments on text-to-image and text-to-video tasks, FIND achieves superior prompt consistency with substantial speedups over RL-based baselines. This approach enables efficient, black-box alignment of diffusion outputs to prompts without retraining the underlying model, broadening practical deployment across modalities.

Abstract

In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific intervals of the initial noise distribution corresponding to the prompt. Moreover, it is challenging to directly optimize the initial distribution, given that the diffusion process involves multiple denoising steps. In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. To this end, we first reformulate the diffusion denoising procedure as a one-step Markov decision process and employ policy optimization to directly optimize the initial distribution. In addition, a dynamic reward calibration module is proposed to ensure training stability during optimization. Furthermore, we introduce a ratio clipping algorithm to utilize historical data for network training and prevent the optimized distribution from deviating too far from the original policy to restrain excessive optimization magnitudes. Extensive experiments demonstrate the effectiveness of our method in both text-to-image and text-to-video tasks, surpassing SOTA methods in achieving consistency between prompts and the generated content. Our method achieves 10 times faster than the SOTA approach. Our homepage is available at \url{https://github.com/vpx-ecnu/FIND-website}.
Paper Structure (27 sections, 15 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 15 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: The optimization iteration of our FIND. Firstly, we sample $\mathbf{z}_T \sim \pi_\theta$, then generate an image through a T-step denoising process. Next, we optimize the reward prediction network $g$ by $\mathcal{L}^*_g$. Subsequently, we update the initial distribution $\pi_\theta$ using the policy gradient by $\mathcal{L}^*_p$.
  • Figure 2: Left: The optimization of $g$. $N$ is the number of iterations. Right: The motivation of DRCM. $R$ is the value of reward.
  • Figure 3: Quality comparison results on different methods. The input prompt of the first two columns: A green dog is running on the grass. Third and fourth column: A dog and a cat. Fifth and sixth column: Four pandas. Seventh and eighth: A dog on the moon.
  • Figure 4: Quality results of ablation study. The prompt of left part: A red book and a yellow vase. Right part: oil portrait of Batman holding a picture of Spiderman, intricate, elegant, highly detailed, lighting, painting, art station, smooth, illustration, art by Greg Rutkowski and Alphonse Mucha.
  • Figure 5: Quality results on video diffusion models. The prompt of the top left corner: A green dog is running on the grass. Top right corner: A dog is running on the moon. Bottom left corner: A panda is walking on the grass, from left to right. Bottom right corner: A monkey is playing guitar.
  • ...and 5 more figures