Table of Contents
Fetching ...

Pushing the Boundaries of State Space Models for Image and Video Generation

Yicong Hong, Long Mai, Yuan Yao, Feng Liu

TL;DR

The paper investigates pushing state-space models (SSMs) to the forefront of visual generation by introducing the Hydra-Transformer Hybrid (HTH), a 5B-parameter diffusion model that combines Hydra bidirectional SSMs with self-attention. By interleaving Hydra and Transformer blocks and adapting scanning patterns for video, HTH achieves up to 2K image generation and 360p video generation with strong prompt fidelity and temporal coherence. Empirical results demonstrate competitive image and video quality against diffusion-based baselines and reveal efficiency benefits for long sequences, while also identifying limitations in global dependency modeling and conditioning. The work highlights practical potential and future research directions for efficient, high-capacity, SSM-based visual generation, including hardware optimizations and improved conditioning strategies.

Abstract

While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.

Pushing the Boundaries of State Space Models for Image and Video Generation

TL;DR

The paper investigates pushing state-space models (SSMs) to the forefront of visual generation by introducing the Hydra-Transformer Hybrid (HTH), a 5B-parameter diffusion model that combines Hydra bidirectional SSMs with self-attention. By interleaving Hydra and Transformer blocks and adapting scanning patterns for video, HTH achieves up to 2K image generation and 360p video generation with strong prompt fidelity and temporal coherence. Empirical results demonstrate competitive image and video quality against diffusion-based baselines and reveal efficiency benefits for long sequences, while also identifying limitations in global dependency modeling and conditioning. The work highlights practical potential and future research directions for efficient, high-capacity, SSM-based visual generation, including hardware optimizations and improved conditioning strategies.

Abstract

While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.

Paper Structure

This paper contains 29 sections, 5 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Text-to-1K/2K+ image generation results of our Hydra-Transformer Hybrid model. The resolution of each sample is displayed in the bottom-right corner. Text prompts and additional results are provided in the Appendix. Please zoom in for a clearer visualization.
  • Figure 2: Text-to-360p 128 frames video generation results produced by our Hydra-Transformer Hybrid model. More results are provided in the Appendix. Please zoom in for a clearer visualization.
  • Figure 3: Illustration of our diffusion Hybrid Hydra (HTH) model for image and video generation. The architecture consists of $N$ stacked blocks, each comprising a cross-attention layer, a token mixer, and a feed-forward network. (a) The token mixer can be implemented as either the Hydra state space model or self-attention. (b) For image data, we use horizontal and vertical bidirectional raster scans on tokens, and for video data, an additional bidirectional temporal scan is applied.
  • Figure 4: Illustration of the model adaptation from state 1 (T2I) to state 2 (T2V). For each set of 11 blocks in our HTH model, we change the scanning pattern of certain Hydra layers from spatial-major to temporal-major scan when processing video data.
  • Figure 5: Denoiser inference speed comparison between HTH and Transformer (our DiT baseline).
  • ...and 8 more figures