Table of Contents
Fetching ...

Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li

TL;DR

The paper tackles the challenge of autoregressive long-video generation by introducing Macro-from-Micro Planning (MMPL), a hierarchical planning framework that first establishes micro-level keyframes within each segment and then stitches these into a global macro plan to maintain long-horizon coherence. Content Populating then generates intermediate frames in parallel across segments, guided by planning frames, while an adaptive workload scheduler mitigates GPU bottlenecks. Empirical results show MMPL achieves superior quality and stability compared with strong baselines, along with substantial speedups in multi-GPU settings and robust human preferences. The approach also includes a drift-resilient re-encoding strategy to stabilize inter-segment transitions and is compatible with existing acceleration and self-forcing methods. Limitations include reliance on a static text prompt for hour-long videos, with future work aiming to enable dynamic prompts and real-time streaming via distillation.

Abstract

Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.

Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

TL;DR

The paper tackles the challenge of autoregressive long-video generation by introducing Macro-from-Micro Planning (MMPL), a hierarchical planning framework that first establishes micro-level keyframes within each segment and then stitches these into a global macro plan to maintain long-horizon coherence. Content Populating then generates intermediate frames in parallel across segments, guided by planning frames, while an adaptive workload scheduler mitigates GPU bottlenecks. Empirical results show MMPL achieves superior quality and stability compared with strong baselines, along with substantial speedups in multi-GPU settings and robust human preferences. The approach also includes a drift-resilient re-encoding strategy to stabilize inter-segment transitions and is compatible with existing acceleration and self-forcing methods. Limitations include reliance on a static text prompt for hour-long videos, with future work aiming to enable dynamic prompts and real-time streaming via distillation.

Abstract

Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.

Paper Structure

This paper contains 21 sections, 16 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: We propose Macro-from-Micro Planning (MMPL), a paradigm for long-video generation that achieves higher visual quality, faster speed, and stronger user preference than existing methods. Snapshots at 0s, 10s, 20s, and 30s (left) show robustness against temporal drift—semantic shifts, color changes, and structural artifacts—while quantitative results highlight accelerated multi-GPU inference (top-right) and dominant user preference (bottom-right).
  • Figure 2: Existing AR methods generate frames sequentially in a step-by-step manner, inevitably causing error accumulation (as shown in Figure \ref{['fig:demo_begin']}) and prohibiting parallel generation.
  • Figure 3: Overall framework of Macro-from-Micro Planning. Our method operates on two planning levels: (1) Micro Planning, which predict a sequence of future frames within each segment to mitigate local error accumulation, and (2) Macro Planning, formed as an Autoregressive Chain of Micro Plans, where the planning frames of the first segment autoregressively generate the planning frames of subsequent segments, ensuring long-horizon temporal consistency.
  • Figure 4: Our Re-Encoding and Decoding Strategy.
  • Figure 5: Two Stages of our MMPL-based Content Populating.
  • ...and 11 more figures