Table of Contents
Fetching ...

Single Trajectory Distillation for Accelerating Image and Video Style Transfer

Sijie Xu, Runqi Wang, Wei Zhu, Dejia Song, Nemo Chen, Xu Tang, Yao Hu

TL;DR

Single Trajectory Distillation (STD) tackles slow diffusion-based image and video stylization by distilling a complete $PF\text{-}ODE$ denoising trajectory starting from a fixed partial-noise state, rather than only aligning the initial step. The method introduces a trajectory bank to reuse teacher trajectories and an asymmetric adversarial loss with DINO-v2 features to enhance style and saturation while suppressing texture noise. Empirical results on image and video stylization show STD surpasses prior acceleration methods in style similarity and aesthetics, with ablations confirming contributions from STD and the asymmetric loss. The approach promises practical speedups for real-world stylization tasks and can extend to other partially-noised editing tasks such as inpainting.

Abstract

Diffusion-based stylization methods typically denoise from a specific partial noise state for image-to-image and video-to-video tasks. This multi-step diffusion process is computationally expensive and hinders real-world application. A promising solution to speed up the process is to obtain few-step consistency models through trajectory distillation. However, current consistency models only force the initial-step alignment between the probability flow ODE (PF-ODE) trajectories of the student and the imperfect teacher models. This training strategy can not ensure the consistency of whole trajectories. To address this issue, we propose single trajectory distillation (STD) starting from a specific partial noise state. We introduce a trajectory bank to store the teacher model's trajectory states, mitigating the time cost during training. Besides, we use an asymmetric adversarial loss to enhance the style and quality of the generated images. Extensive experiments on image and video stylization demonstrate that our method surpasses existing acceleration models in terms of style similarity and aesthetic evaluations. Our code and results will be available on the project page: https://single-trajectory-distillation.github.io.

Single Trajectory Distillation for Accelerating Image and Video Style Transfer

TL;DR

Single Trajectory Distillation (STD) tackles slow diffusion-based image and video stylization by distilling a complete denoising trajectory starting from a fixed partial-noise state, rather than only aligning the initial step. The method introduces a trajectory bank to reuse teacher trajectories and an asymmetric adversarial loss with DINO-v2 features to enhance style and saturation while suppressing texture noise. Empirical results on image and video stylization show STD surpasses prior acceleration methods in style similarity and aesthetics, with ablations confirming contributions from STD and the asymmetric loss. The approach promises practical speedups for real-world stylization tasks and can extend to other partially-noised editing tasks such as inpainting.

Abstract

Diffusion-based stylization methods typically denoise from a specific partial noise state for image-to-image and video-to-video tasks. This multi-step diffusion process is computationally expensive and hinders real-world application. A promising solution to speed up the process is to obtain few-step consistency models through trajectory distillation. However, current consistency models only force the initial-step alignment between the probability flow ODE (PF-ODE) trajectories of the student and the imperfect teacher models. This training strategy can not ensure the consistency of whole trajectories. To address this issue, we propose single trajectory distillation (STD) starting from a specific partial noise state. We introduce a trajectory bank to store the teacher model's trajectory states, mitigating the time cost during training. Besides, we use an asymmetric adversarial loss to enhance the style and quality of the generated images. Extensive experiments on image and video stylization demonstrate that our method surpasses existing acceleration models in terms of style similarity and aesthetic evaluations. Our code and results will be available on the project page: https://single-trajectory-distillation.github.io.

Paper Structure

This paper contains 31 sections, 25 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Visualization of Results. Stylization examples of our method at the number of function evaluations (NFEs) 8 and 4.
  • Figure 2: Comparison with other distillation schemes. (a) represents other distillation schemes where $\bm{x}_t$ is obtained by adding noise to $\bm{x}_0$, fitting only the initial portions of multiple PF-ODE trajectories. (b) represents our single-trajectory distillation scheme, where $\bm{x}_t$ is derived by denoising from $\bm{x}_{\tau_\eta}$, fitting a complete single trajectory starting from $\bm{x}_{\tau_\eta}$.
  • Figure 3: The diagram illustrates the single-trajectory distillation algorithm based on stable diffusion. On the left side is the trajectory bank, which manages samples from the teacher model's trajectory. Random samples $\bm{x}_t$ are drawn from this bank for training, and the sample states are updated after each one-step sampling by the teacher model to avoid repeated sampling and minimize time consumption. In the center, we present the single-trajectory distillation for image and video consistency distillation training. Here, only the student model is trained to align with the teacher model's trajectory. On the right side is the asymmetric adversarial loss component. The adversarial loss is based on DINO-v2, comparing the student model's prediction at timestep $s$ with the noisy ground truth at timestep $r$, where $r < s$. This approach improves the style and image quality.
  • Figure 4: Comparison of Experimental Results. The figure shows some comparison examples among our method, STD, and other acceleration methods, including LCM, TCD, PCM, TDD, Hyper-SD, and SDXL-Lightning. On the left, we also present the original image and the result obtained using a 20-step Euler Solver. All methods are evaluated under the settings of CFG=6 and NFE=8.
  • Figure 5: Line chart comparing methods under different CFG values. The horizontal axis represents the style similarity metric (CSD), and the vertical axis represents the aesthetic score. The chart shows the metric lines for our method and comparison methods at CFG values of 2, 4, 6, and 8, where closer proximity to the upper-right corner indicates better performance.
  • ...and 7 more figures