TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
Yukuo Ma, Cong Liu, Junke Wang, Junqi Liu, Haibin Huang, Zuxuan Wu, Chi Zhang, Xuelong Li
TL;DR
TempoMaster addresses the challenge of producing long, temporally coherent videos efficiently. It introduces next-frame-rate prediction, first creating a low-frame-rate global blueprint and then progressively refining at higher frame rates through a Multi-Mask Diffusion Transformer, enabling parallel refinement across temporal segments. Key contributions include a two-stage training regime on multi-frame-rate data, a unified conditioning framework for diverse inputs, and a parallel inference strategy with theoretical speedups. Empirical results on Vbench and human studies demonstrate state-of-the-art performance for long-video generation and robust ablations validate the design choices, highlighting practical impact for scalable, high-quality video synthesis. The approach promises efficient, coherent long-video generation suitable for real-world applications in film, storytelling, and interactive media.
Abstract
We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis. Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.
