Table of Contents
Fetching ...

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, Wei Zhao

Abstract

Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

Abstract

Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.

Paper Structure

This paper contains 21 sections, 12 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: SCOPE achieves large speedup while preserving near-original quality on both MAGI-1 and SkyReels-V2.
  • Figure 2: At Frame 13 over steps 54--59: (a) Taylor prediction tracks the ground-truth feature more closely than reuse-only; (b) Taylor prediction also yields lower feature error.
  • Figure 3: Conceptual step-matrix view of asynchronous autoregressive denoising. At each iteration, only part of the valid frame interval is active; dark cells mark frames whose scheduling state is still advancing.
  • Figure 4: Overview of SCOPE for accelerating autoregressive video diffusion by reducing redundant computation in both the spatial and temporal dimensions through selective computation and predictive extrapolation.
  • Figure 5: Qualitative comparison of all methods on representative prompts from the two models. The left two groups show results on SkyReels-V2, while the right group shows results on MAGI-1. Each method row is annotated with its end-to-end speedup on SkyReels-V2 and MAGI-1, respectively. Compared with other accelerated baselines, SCOPE preserves object structure and visual fidelity more consistently while maintaining the strongest overall acceleration-quality tradeoff.
  • ...and 7 more figures