Table of Contents
Fetching ...

TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

Yukuo Ma, Cong Liu, Junke Wang, Junqi Liu, Haibin Huang, Zuxuan Wu, Chi Zhang, Xuelong Li

TL;DR

TempoMaster addresses the challenge of producing long, temporally coherent videos efficiently. It introduces next-frame-rate prediction, first creating a low-frame-rate global blueprint and then progressively refining at higher frame rates through a Multi-Mask Diffusion Transformer, enabling parallel refinement across temporal segments. Key contributions include a two-stage training regime on multi-frame-rate data, a unified conditioning framework for diverse inputs, and a parallel inference strategy with theoretical speedups. Empirical results on Vbench and human studies demonstrate state-of-the-art performance for long-video generation and robust ablations validate the design choices, highlighting practical impact for scalable, high-quality video synthesis. The approach promises efficient, coherent long-video generation suitable for real-world applications in film, storytelling, and interactive media.

Abstract

We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.

TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction

TL;DR

TempoMaster addresses the challenge of producing long, temporally coherent videos efficiently. It introduces next-frame-rate prediction, first creating a low-frame-rate global blueprint and then progressively refining at higher frame rates through a Multi-Mask Diffusion Transformer, enabling parallel refinement across temporal segments. Key contributions include a two-stage training regime on multi-frame-rate data, a unified conditioning framework for diverse inputs, and a parallel inference strategy with theoretical speedups. Empirical results on Vbench and human studies demonstrate state-of-the-art performance for long-video generation and robust ablations validate the design choices, highlighting practical impact for scalable, high-quality video synthesis. The approach promises efficient, coherent long-video generation suitable for real-world applications in film, storytelling, and interactive media.

Abstract

We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.

Paper Structure

This paper contains 29 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: TempoMaster first generates a video sequence at coarse and low frame rate to establish the global dynamics and semantic structure, and subsequently refines it by predicting frames at higher rates, thereby enhancing temporal smoothness and detail. This next-frame-rate prediction paradigm results in videos with improved motion quality and temporal consistency.
  • Figure 2: Different video modeling paradigms. Autoregressive models generate frames sequentially under a causal structure. Bidirectional models generate the entire sequence at once by processing the full sequence directly. TempoMaster establishes the global structure via a low-frame-rate bidirectional pass, then progressively enhances local details via predicting the video at the next higher frame rate.
  • Figure 3: Multi-Frame-Rate Training. TempoMaster is trained on videos with varying frame rates, which are signaled to the model by scaling the interval of the temporal positional indices. As illustrated, training on a video at half the highest frame rate employs a positional index interval of 2.
  • Figure 4: Multi-Mask Condition. Condition frames are zero-padded to the length of the full sequence; their latent representations and a frame-wise mask that provides precise timestep information are then concatenated with the noisy latents to guide generation.
  • Figure 5: The inference process of TempoMaster. TempoMaster first generates videos with the lowest frame rate and the largest interval of temporal position indices. Within the same level, the generated frames can be partitioned into multiple segments to enable parallel generation, which proceeds hierarchically down to the leaf node level.
  • ...and 3 more figures