Table of Contents
Fetching ...

Pack and Force Your Memory: Long-form and Consistent Video Generation

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Xuming He

TL;DR

This work tackles the core bottlenecks of long-form video generation: maintaining long-range temporal coherence and mitigating error accumulation in autoregressive models. It introduces MemoryPack, a memory-augmented framework that fuses short-term FramePack and long-term SemanticPack guided by text and a reference image to model dependencies with linear complexity, and Direct Forcing, a single-step rectified-flow strategy that aligns training with inference without distillation. Together, they achieve state-of-the-art results on VBench across motion, background, and subject consistency, while reducing drift and improving robustness for minute-scale videos. The approach advances practical applicability of autoregressive video models by enabling stable, scalable, and coherent long-form generation with efficient training and inference, and it provides a thorough evaluation including quantitative metrics and human judgments.

Abstract

Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

Pack and Force Your Memory: Long-form and Consistent Video Generation

TL;DR

This work tackles the core bottlenecks of long-form video generation: maintaining long-range temporal coherence and mitigating error accumulation in autoregressive models. It introduces MemoryPack, a memory-augmented framework that fuses short-term FramePack and long-term SemanticPack guided by text and a reference image to model dependencies with linear complexity, and Direct Forcing, a single-step rectified-flow strategy that aligns training with inference without distillation. Together, they achieve state-of-the-art results on VBench across motion, background, and subject consistency, while reducing drift and improving robustness for minute-scale videos. The approach advances practical applicability of autoregressive video models by enabling stable, scalable, and coherent long-form generation with efficient training and inference, and it provides a thorough evaluation including quantitative metrics and human judgments.

Abstract

Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

Paper Structure

This paper contains 30 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of our framework. Given a text prompt, an input image, and history frames, the model autoregressively generates future frames. Prior to feeding data into MM-DiT, MemoryPack retrieves both long- and short-term context. In SemanticPack, visual features are extracted within local windows via self-attention, followed by cross-attention to align them with global textual and visual information to iteratively generate long-term dependencies $\psi_{n}$. This design achieves linear computational complexity and substantially improves the efficiency of long-form video generation.
  • Figure 2: Schematic illustration of the approximation process. In Student Forcing, multi-step inference is applied to approximate $\mathbf{\hat{x}}_1$, but this incurs substantial computational overhead and slows training convergence. In contrast, Direct Forcing applies a single-step transformation from $\mathbf{x}_1$ to $\mathbf{x}_t$, followed by a denoising step that produces $\tilde{\mathbf{x}}_1$ as an estimate of $\mathbf{\hat{x}}_1$. This approach incurs no additional computational burden, thereby enabling faster training.
  • Figure 3: Visualization of 30-second videos comparing all methods in terms of consistency preservation and interaction capability. Prompt: Close-up view of vegetables being added into a large silver pot of simmering broth, with leafy greens and stems swirling vividly in the bubbling liquid. Rising steam conveys warmth and motion, while blurred kitchen elements and natural light in the background create a homely yet dynamic culinary atmosphere.
  • Figure 4: Visualization of a 60-second video illustrating the accumulation of errors. Our method maintains image quality comparable to the first frame even over minute-long sequences. Prompt: The sun sets over a serene lake nestled within majestic mountains, casting a warm, golden glow that softens at the horizon. The sky is a vibrant canvas of orange, pink, and purple, with wispy clouds catching the last light. Calm and reflective, the lake's surface mirrors the breathtaking colors of the sky in a symphony of light and shadow. In the foreground, lush greenery and rugged rocks frame the tranquil scene, adding a sense of life and stillness. Majestic, misty mountains rise in the background, creating an overall atmosphere of profound peace and tranquility.
  • Figure 5: Consistency evaluation on a 60-second video shows that when an object ID is heavily occluded for an extended period, reconstruction remains challenging. Both F0 and F1 fail to follow the prompt and exhibit noticeable error accumulation. Although MAGI-1 follows the prompt, it is unable to maintain temporal consistency. Prompt: On the peaceful, sun-drenched sandy beach, a small crab first retreats into its burrow before reemerging. The lens captures its shimmering shell and discreet stride under the low sun angle. As it slowly crawls outward, the crab leaves a faint trail behind, while its elongated shadow adds a cinematic texture to this tranquil scene.
  • ...and 4 more figures