Table of Contents
Fetching ...

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

Zichen Miao, Zhengyuan Yang, Kevin Lin, Ze Wang, Zicheng Liu, Lijuan Wang, Qiang Qiu

TL;DR

PSO tackles the challenge of fine-tuning timestep-distilled diffusion models without sacrificing their few-step generation ability. It introduces a pairwise sample optimization objective that maximizes the relative likelihood between a target image sampled from the data distribution and a reference image from the current model, conditioned on the same prompt, within an MDP-based trajectory framework. The method is general to offline and online pairwise data and encompasses prior preference-optimization approaches as special cases. Empirically, PSO delivers competitive or superior performance across human-preference tuning, style transfer, and concept customization, while reducing the computational burden relative to full distillation or multi-step fine-tuning.

Abstract

Recent advancements in timestep-distilled diffusion models have enabled high-quality image generation that rivals non-distilled multi-step models, but with significantly fewer inference steps. While such models are attractive for applications due to the low inference cost and latency, fine-tuning them with a naive diffusion objective would result in degraded and blurry outputs. An intuitive alternative is to repeat the diffusion distillation process with a fine-tuned teacher model, which produces good results but is cumbersome and computationally intensive; the distillation training usually requires magnitude higher of training compute compared to fine-tuning for specific image styles. In this paper, we present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model. PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images. This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution. We also demonstrate that PSO is a generalized formulation which can be flexibly extended to both offline-sampled and online-sampled pairwise data, covering various popular objectives for diffusion model preference optimization. We evaluate PSO in both preference optimization and other fine-tuning tasks, including style transfer and concept customization. We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data. PSO also demonstrates effectiveness in style transfer and concept customization by directly tuning timestep-distilled diffusion models.

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

TL;DR

PSO tackles the challenge of fine-tuning timestep-distilled diffusion models without sacrificing their few-step generation ability. It introduces a pairwise sample optimization objective that maximizes the relative likelihood between a target image sampled from the data distribution and a reference image from the current model, conditioned on the same prompt, within an MDP-based trajectory framework. The method is general to offline and online pairwise data and encompasses prior preference-optimization approaches as special cases. Empirically, PSO delivers competitive or superior performance across human-preference tuning, style transfer, and concept customization, while reducing the computational burden relative to full distillation or multi-step fine-tuning.

Abstract

Recent advancements in timestep-distilled diffusion models have enabled high-quality image generation that rivals non-distilled multi-step models, but with significantly fewer inference steps. While such models are attractive for applications due to the low inference cost and latency, fine-tuning them with a naive diffusion objective would result in degraded and blurry outputs. An intuitive alternative is to repeat the diffusion distillation process with a fine-tuned teacher model, which produces good results but is cumbersome and computationally intensive; the distillation training usually requires magnitude higher of training compute compared to fine-tuning for specific image styles. In this paper, we present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model. PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images. This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution. We also demonstrate that PSO is a generalized formulation which can be flexibly extended to both offline-sampled and online-sampled pairwise data, covering various popular objectives for diffusion model preference optimization. We evaluate PSO in both preference optimization and other fine-tuning tasks, including style transfer and concept customization. We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data. PSO also demonstrates effectiveness in style transfer and concept customization by directly tuning timestep-distilled diffusion models.
Paper Structure (34 sections, 16 equations, 9 figures, 4 tables, 3 algorithms)

This paper contains 34 sections, 16 equations, 9 figures, 4 tables, 3 algorithms.

Figures (9)

  • Figure 1: Illustration of images sampled from: (a) original timestep-distilled diffusion models, (b) Fine-tuned distilled model with diffusion objective ho2020denoising, and (c) Fine-tuned distilled models with our PSO, and It can be seen that simply tuning distilled models with the vanilla diffusion loss leads to blurry, degraded generation, while our method can steer the distilled model toward better alignment with human preference & prompts, and style-transferred generation. Prompt from left to right: A Pirate in a Pirateship.// a woman with long hair next to a luminescent bird.// Photograph of a wall along a city street with a watercolor mural of foxes in a jazz band.// A stern-faced, brown-feathered owl Pokémon with a leaf-shaped crown and piercing red eyes.// A cute rabbit.
  • Figure 2: Demonstration of the proposed pairwise sample optimization. To tune the generative distribution $p_\theta$ to $p_{data}$, we sample a pair of images together with their trajectories of the same prompt, where we adopt our Markov Decision Process (MDP) formulation for the timestep-distilled diffusion model to efficiently sample the backward denoising trajectories $\{x^\rho_{t_n}\}$, while sampling $\{x^\tau_{t_n}\}$ from data via the forward diffusion process. The sampled trajectories are then sent to the final objective to move the generation trajectory aligned with the forward process from $p_{data}$.
  • Figure 3: Human preference tuning Results with PSO on 4-step SDXL-DMD2. Compared with baseline SDXL-DMD2 (sub-figure (a)), SDXL-DPO with DMD2 LoRA (sub-figure (b)) exhibits a slightly degraded generation quality. Rather, SDXL-DMD2 with both our offline and online PSO objectives (sub-figures (c) and (d) respectively) demonstrates substantial improvement in visual appeal, prompt following, and details generation. Prompts from left to right: The official portrait of an authoritarian president of an alternate America in 1960, in the style of pan am advertisements, looking up, jet age.// A curious cat exploring a haunted mansion.// A profile picture of an anime boy, half robot, brown hair.//On the Mid-Autumn Festival, the bright full moon hangs in the night sky. A quaint pavilion is illuminated by dim lights, resembling a beautiful scenery in a painting. Camera type: close-up. Camera lenstype: telephoto. Time of day: night. Film type: ancient style. HD.
  • Figure 4: Experiments on tuning SDXL-Turbo for style transfer with PSO. The proposed method effectively tunes the distilled model to generate images following the targeted style. Prompts from top to bottom row: A stern-faced, brown-feathered owl Pokémon with a leaf-shaped crown and piercing red eyes stands ready for a battle.// robotic cat with wings.// A cute bunny rabbit.
  • Figure 5: Experiments of tuning SDXL-Turbo for concept customization with PSO. The proposed method can effectively tune SDXL-Turbo to generate images that contain the given objects.
  • ...and 4 more figures