Table of Contents
Fetching ...

T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching

Zizheng Pan, Bohan Zhuang, De-An Huang, Weili Nie, Zhiding Yu, Chaowei Xiao, Jianfei Cai, Anima Anandkumar

TL;DR

T-Stitch tackles the high cost of sampling in diffusion probabilistic models by dynamically allocating computation along the denoising trajectory: a small, cheap denoiser handles early steps to shape the global structure, while a larger denoiser refines details later. The method is training-free, general across architectures, and complementary to existing speedups, enabling configurable speed-quality trade-offs (e.g., ~1.5x–1.7x speedups) without retraining. Empirical results across DiT variants, U-Nets, and Stable Diffusion show robust gains in speed with minimal degradation in quality and even improved prompt alignment for stylized prompts. This approach offers a practical, plug-and-play mechanism to accelerate diffusion-based generation in real-world deployments.

Abstract

Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality image generation and typically requires many steps with a large model. In this paper, we introduce sampling Trajectory Stitching T-Stitch, a simple yet efficient technique to improve the sampling efficiency with little or no generation degradation. Instead of solely using a large DPM for the entire sampling trajectory, T-Stitch first leverages a smaller DPM in the initial steps as a cheap drop-in replacement of the larger DPM and switches to the larger DPM at a later stage. Our key insight is that different diffusion models learn similar encodings under the same training data distribution and smaller models are capable of generating good global structures in the early steps. Extensive experiments demonstrate that T-Stitch is training-free, generally applicable for different architectures, and complements most existing fast sampling techniques with flexible speed and quality trade-offs. On DiT-XL, for example, 40% of the early timesteps can be safely replaced with a 10x faster DiT-S without performance drop on class-conditional ImageNet generation. We further show that our method can also be used as a drop-in technique to not only accelerate the popular pretrained stable diffusion (SD) models but also improve the prompt alignment of stylized SD models from the public model zoo. Code is released at https://github.com/NVlabs/T-Stitch

T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching

TL;DR

T-Stitch tackles the high cost of sampling in diffusion probabilistic models by dynamically allocating computation along the denoising trajectory: a small, cheap denoiser handles early steps to shape the global structure, while a larger denoiser refines details later. The method is training-free, general across architectures, and complementary to existing speedups, enabling configurable speed-quality trade-offs (e.g., ~1.5x–1.7x speedups) without retraining. Empirical results across DiT variants, U-Nets, and Stable Diffusion show robust gains in speed with minimal degradation in quality and even improved prompt alignment for stylized prompts. This approach offers a practical, plug-and-play mechanism to accelerate diffusion-based generation in real-world deployments.

Abstract

Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality image generation and typically requires many steps with a large model. In this paper, we introduce sampling Trajectory Stitching T-Stitch, a simple yet efficient technique to improve the sampling efficiency with little or no generation degradation. Instead of solely using a large DPM for the entire sampling trajectory, T-Stitch first leverages a smaller DPM in the initial steps as a cheap drop-in replacement of the larger DPM and switches to the larger DPM at a later stage. Our key insight is that different diffusion models learn similar encodings under the same training data distribution and smaller models are capable of generating good global structures in the early steps. Extensive experiments demonstrate that T-Stitch is training-free, generally applicable for different architectures, and complements most existing fast sampling techniques with flexible speed and quality trade-offs. On DiT-XL, for example, 40% of the early timesteps can be safely replaced with a 10x faster DiT-S without performance drop on class-conditional ImageNet generation. We further show that our method can also be used as a drop-in technique to not only accelerate the popular pretrained stable diffusion (SD) models but also improve the prompt alignment of stylized SD models from the public model zoo. Code is released at https://github.com/NVlabs/T-Stitch
Paper Structure (34 sections, 5 equations, 33 figures, 11 tables)

This paper contains 34 sections, 5 equations, 33 figures, 11 tables.

Figures (33)

  • Figure 1: Top: FID comparison on class-conditional ImageNet when progressively stitching more DiT-S steps at the beginning and fewer DiT-XL steps in the end, based on DDIM 100 timesteps and a classifier-free guidance scale of 1.5. FID is calculated by sampling 5000 images. Bottom: One example of stitching more DiT-S steps to achieve faster sampling, where the time cost is measured by generating 8 images on one RTX 3090 in seconds (s).
  • Figure 2: By directly adopting a small SD in the model zoo, T-Stitch naturally interpolates the speed, style, and image contents with a large styled SD, which also potentially improves the prompt alignment, e.g., "New York City" and "tropical beach" in the above examples.
  • Figure 3: Similarity comparison of latent embeddings at different denoising steps between different DiT models. Results are averaged over 32 images.
  • Figure 4: Trajectory Stitching (T-Stitch): Based on pretrained small and large DPMs, we can leverage the more efficient small DPM with different percentages at the early denoising sampling steps to achieve different speed-quality trade-offs.
  • Figure 5: T-Stitch of two model combinations: DiT-XL/S, DiT-XL/B and DiT-B/S. We adopt DDIM 100 timesteps with a classifier-free guidance scale of 1.5.
  • ...and 28 more figures