Table of Contents
Fetching ...

HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration

Desen Sun, Jason Hon, Jintao Zhang, Sihang Liu

TL;DR

This work proposes HybridStitch, a new T2I generation paradigm that treats generation like editing and achieves 1.83$\times$ speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.

Abstract

Diffusion models have demonstrated a remarkable ability in Text-to-Image (T2I) generation applications. Despite the advanced generation output, they suffer from heavy computation overhead, especially for large models that contain tens of billions of parameters. Prior work has illustrated that replacing part of the denoising steps with a smaller model still maintains the generation quality. However, these methods only focus on saving computation for some timesteps, ignoring the difference in compute demand within one timestep. In this work, we propose HybridStitch, a new T2I generation paradigm that treats generation like editing. Specifically, we introduce a hybrid stage that jointly incorporates both the large model and the small model. HybridStitch separates the entire image into two regions: one that is relatively easy to render, enabling an early transition to the smaller model, and another that is more complex and therefore requires refinement by the large model. HybridStitch employs the small model to construct a coarse sketch while exploiting the large model to edit and refine the complex regions. According to our evaluation, HybridStitch achieves 1.83$\times$ speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.

HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration

TL;DR

This work proposes HybridStitch, a new T2I generation paradigm that treats generation like editing and achieves 1.83 speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.

Abstract

Diffusion models have demonstrated a remarkable ability in Text-to-Image (T2I) generation applications. Despite the advanced generation output, they suffer from heavy computation overhead, especially for large models that contain tens of billions of parameters. Prior work has illustrated that replacing part of the denoising steps with a smaller model still maintains the generation quality. However, these methods only focus on saving computation for some timesteps, ignoring the difference in compute demand within one timestep. In this work, we propose HybridStitch, a new T2I generation paradigm that treats generation like editing. Specifically, we introduce a hybrid stage that jointly incorporates both the large model and the small model. HybridStitch separates the entire image into two regions: one that is relatively easy to render, enabling an early transition to the smaller model, and another that is more complex and therefore requires refinement by the large model. HybridStitch employs the small model to construct a coarse sketch while exploiting the large model to edit and refine the complex regions. According to our evaluation, HybridStitch achieves 1.83 speedup on Stable Diffusion 3, which is faster than all existing mixture of model methods.
Paper Structure (32 sections, 7 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 32 sections, 7 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) Naively switch model at the entire image granularity cheng2025srdiffusionacceleratevideodiffusion. It achieves a 1.55 $\times$ speedup over the large model. (b) The major difference in the predictions between the large and small models. Left is the 40 % difference of the output, right is the output image. (c) Region-aware switch model. Some pixels switch to the next model while the other pixels (marked as Mask) continue with the previous model. It achieves a 1.83$\times$ speedup over the large model.
  • Figure 2: Distribution of absolute difference values between the large and small models across steps 10, 30, and 50.
  • Figure 3: The mask update method at each denoising step. $X_T$ means the latent at $T$ timestep. $M_{T-m}$ is the masked input for large model at $T-m$ step. $X'_{T-m}$ is the temporary latent generated by the small model. The input of Masked Generation is the latent states from the previous denoising step and the mask.
  • Figure 4: Qualitative results. FID is computed against the ground-truth images.
  • Figure 5: Speedup under other GPU types: H100 and A100.
  • ...and 3 more figures