Pushing the Boundaries of State Space Models for Image and Video Generation
Yicong Hong, Long Mai, Yuan Yao, Feng Liu
TL;DR
The paper investigates pushing state-space models (SSMs) to the forefront of visual generation by introducing the Hydra-Transformer Hybrid (HTH), a 5B-parameter diffusion model that combines Hydra bidirectional SSMs with self-attention. By interleaving Hydra and Transformer blocks and adapting scanning patterns for video, HTH achieves up to 2K image generation and 360p video generation with strong prompt fidelity and temporal coherence. Empirical results demonstrate competitive image and video quality against diffusion-based baselines and reveal efficiency benefits for long sequences, while also identifying limitations in global dependency modeling and conditioning. The work highlights practical potential and future research directions for efficient, high-capacity, SSM-based visual generation, including hardware optimizations and improved conditioning strategies.
Abstract
While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.
