VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory
Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi
TL;DR
VideoSSM tackles the challenge of long-horizon video generation by treating synthesis as a recurrent process that requires both short-term precision and long-term coherence. It introduces a hybrid memory architecture that couples a local sliding-window memory with a dynamic global state-space memory, enabling linear-time, causal diffusion-based video generation with minimal drift. Through Self-Forcing distillation and rolling memory during training, VideoSSM achieves state-of-the-art temporal consistency on minute-scale sequences and supports interactive prompt-based control. The approach demonstrates strong performance on short and long benchmarks, with qualitative and user-study evidence of improved motion realism and coherence, suggesting a scalable path toward robust long-form video synthesis.
Abstract
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.
