Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation
Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li
TL;DR
The paper tackles the challenge of autoregressive long-video generation by introducing Macro-from-Micro Planning (MMPL), a hierarchical planning framework that first establishes micro-level keyframes within each segment and then stitches these into a global macro plan to maintain long-horizon coherence. Content Populating then generates intermediate frames in parallel across segments, guided by planning frames, while an adaptive workload scheduler mitigates GPU bottlenecks. Empirical results show MMPL achieves superior quality and stability compared with strong baselines, along with substantial speedups in multi-GPU settings and robust human preferences. The approach also includes a drift-resilient re-encoding strategy to stabilize inter-segment transitions and is compatible with existing acceleration and self-forcing methods. Limitations include reliance on a static text prompt for hour-long videos, with future work aiming to enable dynamic prompts and real-time streaming via distillation.
Abstract
Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.
