Table of Contents
Fetching ...

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu

TL;DR

TurboDiffusion targets the slow bottleneck of diffusion-based video generation by integrating four acceleration primitives. It combines attention acceleration (low-bit SageAttention and Sparse-Linear Attention), step distillation via rCM, and W8A8 quantization to compress and speed up inference, with additional engineering optimizations. On Wan2.2-I2V-A14B-720P and Wan2.1-T2V models, it delivers 100–200× end-to-end speedups on a single RTX 5090 while preserving video quality, and includes a ready-to-use GitHub repository. The approach demonstrates practical, high-speed video generation that narrows the gap between diffusion-based methods and real-time needs. Future work aims to extend to autoregressive video diffusion and other paradigms.

Abstract

We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

TL;DR

TurboDiffusion targets the slow bottleneck of diffusion-based video generation by integrating four acceleration primitives. It combines attention acceleration (low-bit SageAttention and Sparse-Linear Attention), step distillation via rCM, and W8A8 quantization to compress and speed up inference, with additional engineering optimizations. On Wan2.2-I2V-A14B-720P and Wan2.1-T2V models, it delivers 100–200× end-to-end speedups on a single RTX 5090 while preserving video quality, and includes a ready-to-use GitHub repository. The approach demonstrates practical, high-speed video generation that narrows the gap between diffusion-based methods and real-time needs. Future work aims to extend to autoregressive video diffusion and other paradigms.

Abstract

We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.

Paper Structure

This paper contains 12 sections, 29 figures.

Figures (29)

  • Figure 1: An example of a 5-second video generated by Wan2.1-T2V-1.3B-480Pon a single RTX 5090.
  • Figure 2: An example of a 5-second video generation on Wan2.2-I2V-A14B-720Pusing a single RTX 5090.
  • Figure 3: Speedup of TurboDiffusion on various video generation models on a single RTX 5090. For Wan2.2-I2V-A14B-720P, the latency includes the switching overhead between the high-noise and low-noise models, resulting in a lower measured speedup compared to Wan2.1-T2V-14B-720P. In theory, the achievable speedup is identical.
  • Figure 4: By algorithm and system co-optimization, TurboDiffusion reduces the diffusion inference latency of Wan2.1-T2V-14B-720P by around 200$\times$on a single RTX 5090.
  • Figure 5: 5-second video generation on Wan2.2-I2V-A14B-720Pusing a single RTX 5090. Image prompt is the first frame and the text prompt is "POV selfie video, ultra-messy and extremely fast. A white cat in sunglasses stands on a surfboard with a neutral look when the board suddenly whips sideways, throwing cat and camera into the water; the frame dives sharply downward, swallowed by violent bursts of bubbles, spinning turbulence, and smeared water streaks as the camera sinks. Shadows thicken, pressure ripples distort the edges, and loose bubbles rush upward past the lens, showing the camera is still sinking. Then the cat kicks upward with explosive speed, dragging the view through churning bubbles and rapidly brightening water as sunlight floods back in; the camera races upward, water streaming off the lens, and finally breaks the surface in a sudden blast of light and spray, snapping back into a crooked, frantic selfie as the cat resurfaces."
  • ...and 24 more figures