Table of Contents
Fetching ...

SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

Shenggan Cheng, Yuanxin Wei, Lansong Diao, Yong Liu, Bujiao Chen, Lianghua Huang, Yu Liu, Wenyuan Yu, Jiangsu Du, Wei Lin, Yang You

TL;DR

SRDiffusion tackles the high computational cost of diffusion-based video generation by introducing Sketching-Rendering Cooperation: a large model operates in the high-noise early steps to ensure semantic structure and motion fidelity, while a smaller model rapidly refines details in the low-noise later steps. An adaptive switching mechanism selects when to transition from sketching to rendering based on the dynamics of the denoising process. The approach is claimed to be orthogonal to existing acceleration methods, achieving multi-fold speedups (e.g., >3× on Wan with negligible quality loss on VBench and ~2× on CogVideoX) and synergizing with caching and hardware-specific optimizations (TeaCache, SageAttention). The results suggest practical, scalable video generation improvements, with limitations noted in cross-model latent alignment and future work focusing on broader model interoperability.

Abstract

Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$\times$ speedup for Wan with nearly no quality loss for VBench, and 2$\times$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.

SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

TL;DR

SRDiffusion tackles the high computational cost of diffusion-based video generation by introducing Sketching-Rendering Cooperation: a large model operates in the high-noise early steps to ensure semantic structure and motion fidelity, while a smaller model rapidly refines details in the low-noise later steps. An adaptive switching mechanism selects when to transition from sketching to rendering based on the dynamics of the denoising process. The approach is claimed to be orthogonal to existing acceleration methods, achieving multi-fold speedups (e.g., >3× on Wan with negligible quality loss on VBench and ~2× on CogVideoX) and synergizing with caching and hardware-specific optimizations (TeaCache, SageAttention). The results suggest practical, scalable video generation improvements, with limitations noted in cross-model latent alignment and future work focusing on broader model interoperability.

Abstract

Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3 speedup for Wan with nearly no quality loss for VBench, and 2 speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.

Paper Structure

This paper contains 20 sections, 4 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Impact of perturbations at various diffusion steps on the quality of video frames.
  • Figure 2: Overview of Sketching-Rendering Cooperation. Taking pipeline of Wan model as an example, illustrates the pipeline switches from Wan 14B to Wan 1.3B at timestep $t$.
  • Figure 3: Predicted noise difference across denoising steps in Wan-14B 480p and CogVideoX-5B 480p. Different colors represent the value of different prompts.
  • Figure 4: Visualization Results. We compare the generation quality between original model, our method and baselines. (SDR: SRDiffusion, TC: TeaCache)
  • Figure 5: SRDiffusion combined with TeaCache and SageAttention achieves over 6× speedup on a single NVIDIA H20. (SDR: SRDiffusion, TC: TeaCache, SA: SageAttention)
  • ...and 4 more figures