Table of Contents
Fetching ...

FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yizeng Han, Jiasheng Tang, Yuanjie Xing, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, Bohan Zhuang

TL;DR

Diffusion-based video generation faces high computational demands, especially for attention. FPSAttention proposes a training-aware co-design that jointly optimizes FP8 tile-wise quantization and structured 3D sparsity for video diffusion transformers, with a denoising-step-aware schedule and hardware-optimized kernels. On Wan2.1 1.3B and 14B models evaluated with VBench, it achieves up to 7.09x kernel speedup and 4.96x end-to-end speedup at 720p while maintaining generation quality. This work demonstrates robust performance of structured, tile-aligned compression in diffusion models and highlights a practical pathway for deploying fast, high-quality video generation on FP8-friendly hardware.

Abstract

Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization. We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.

FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

TL;DR

Diffusion-based video generation faces high computational demands, especially for attention. FPSAttention proposes a training-aware co-design that jointly optimizes FP8 tile-wise quantization and structured 3D sparsity for video diffusion transformers, with a denoising-step-aware schedule and hardware-optimized kernels. On Wan2.1 1.3B and 14B models evaluated with VBench, it achieves up to 7.09x kernel speedup and 4.96x end-to-end speedup at 720p while maintaining generation quality. This work demonstrates robust performance of structured, tile-aligned compression in diffusion models and highlights a practical pathway for deploying fast, high-quality video generation on FP8-friendly hardware.

Abstract

Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization. We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.

Paper Structure

This paper contains 22 sections, 6 equations, 7 figures, 20 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparing previous training-free quantization (a) and training-free sparsity (b) approaches reveals substantial accuracy degradation and a lack of compatibility when used independently. In contrast, our FPSAttention framework (c) integrates low-precision and sparse patterns in a single training process, yielding near-zero accuracy loss and seamless deployment.
  • Figure 2: Comparison of video generation results. Top: Training Free FP8 + STA. Bottom: Our Training-Aware FPSAttention . Click the image to play the video via Acrobat Reader.
  • Figure 3: Overview of FPSAttention. (1) Our approach synergistically optimizes joint quantization and sparsity patterns within the attention mechanism for efficient video generation. (2) We introduce a novel denoising step-aware strategy that dynamically adapts the granularity throughout the diffusion process, balancing computational efficiency and perceptual fidelity. Empirical observations are shown in Figure \ref{['fig:FPSAttention_adaptive_schedule']}. (3) A fused hardware-friendly kernel is applied for attention operations.
  • Figure 4: Quantization granularities: per-token, per-channel, per-group, and our per 3D-tile, which aligns with hardware compute patterns.
  • Figure 5: Joint quantization and sparsity error patterns across denoising steps. Blue: token-level granularity; orange: our 3D tile-wise granularity with sparse attention. Key insight: early/late steps tolerate coarser quantization and higher sparsity, while intermediate steps require finer granularity and denser attention. Our FPSAttention (green) closely approximates highest-granularity methods, validating our adaptive scheduling strategy. All measurements from inference are with identical prompts. Performance gaps primarily stem from FP8 quantization rather than sparsity constraints.
  • ...and 2 more figures