Table of Contents
Fetching ...

VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi

TL;DR

VideoSSM tackles the challenge of long-horizon video generation by treating synthesis as a recurrent process that requires both short-term precision and long-term coherence. It introduces a hybrid memory architecture that couples a local sliding-window memory with a dynamic global state-space memory, enabling linear-time, causal diffusion-based video generation with minimal drift. Through Self-Forcing distillation and rolling memory during training, VideoSSM achieves state-of-the-art temporal consistency on minute-scale sequences and supports interactive prompt-based control. The approach demonstrates strong performance on short and long benchmarks, with qualitative and user-study evidence of improved motion realism and coherence, suggesting a scalable path toward robust long-form video synthesis.

Abstract

Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

TL;DR

VideoSSM tackles the challenge of long-horizon video generation by treating synthesis as a recurrent process that requires both short-term precision and long-term coherence. It introduces a hybrid memory architecture that couples a local sliding-window memory with a dynamic global state-space memory, enabling linear-time, causal diffusion-based video generation with minimal drift. Through Self-Forcing distillation and rolling memory during training, VideoSSM achieves state-of-the-art temporal consistency on minute-scale sequences and supports interactive prompt-based control. The approach demonstrates strong performance on short and long benchmarks, with qualitative and user-study evidence of improved motion realism and coherence, suggesting a scalable path toward robust long-form video synthesis.

Abstract

Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

Paper Structure

This paper contains 22 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: We introduce VideoSSM, an AR video diffusion model equipped with a novel hybrid memory architecture that combines a causal sliding-window local lossless cache with an SSM-based global compressed memory. Compared with prior AR Diffusion Transformers (AR DiTs) that use only window attention, which suffer from quality degradation and temporal drifting, or add sink frames, which reduce drift but cause content repetition and a lack of dynamism, our hybrid memory yields videos that remain both long-term consistent and progressively dynamic. Trained via Self Forcing distillation huang2025self via DMD loss yin2024onestep from a bidirectional teacher, VideoSSM supports highly stable long video generation and adaptive, interactive prompt-based video generation.
  • Figure 2: Comparison of DiT block architectures for autoregressive video generation.(a) Standard DiT block with full self-attention, which supports long-context modeling but lacks causality and streaming capability. (b) AR DiT block with masked causal attention, enabling autoregressive and streaming generation at the cost of weakened long-context consistency. (c) Our AR DiT block with a hybrid memory module and router, which combines local causal attention with a learnable global memory to achieve causal generation, streaming, and long-context support.
  • Figure 3: Illustration of attention mechanisms in AR DiT. Let $T$ be the video token length and $L$ the sliding-window size. (a) Causal Attention: Each query attends to all past tokens. It captures the full context with quadratic O(T²) complexity, impractical for long sequences. (b) Window Attention: Localized attention within a local sliding window. It enables efficient O(TL) complexity for streaming but causes information drift as early tokens are evicted. (c) Attention Sink: Adds fixed initial "sink" tokens to the window. It improves long-range consistency with O(TL) complexity, but the static memory leads to repetitive generation and fails to adapt to new content (d) Ours (Hybrid Memory): Augments window attention with a learnable memory that compresses evicted tokens. This maintains O(TL) efficiency while providing a dynamic global context, balancing long-term consistency and adaptability.
  • Figure 4: Illustration of how sink, evicted, and window tokens are arranged at different timesteps in a causal DiT with sliding-window attention. Here window length $L=3$.
  • Figure 5: Architecture of the proposed hybrid memory module. The input $H^{\text{in}}_t$ is processed in two streams. The local path (top) uses windowed attention with a sliding KV cache to compute $H^{\text{local}}_t$. The global path (bottom) uses a State-Space Model (SSM) to recurrently compress historical information into a memory state $M$, which is retrieved to produce $H^{\text{global}}_t$. A router then dynamically fuses the local and global outputs.
  • ...and 2 more figures