Table of Contents
Fetching ...

ZigMa: A DiT-style Zigzag Mamba Diffusion Model

Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Schusterbauer, Björn Ommer

TL;DR

This work tackles diffusion-model scalability limits in transformer-heavy backbones by leveraging a State-Space Model-based diffusion backbone, Mamba, extended to 2D images and 3D videos through a lightweight, zero-parameter Zigzag scanning scheme (ZigMa). Central to the approach is enforcing spatial continuity via per-layer zigzag token arrangements and augmenting with cross-attention to support text conditioning, all within the Stochastic Interpolant diffusion framework. The authors provide extensive ablations and demonstrate competitive or superior performance on high-resolution data (FacesHQ 1024x1024) and video benchmarks, with favorable speed and memory profiles compared to transformer-based backbones. The work suggests a practical path toward scalable, inductive-bias-aware diffusion for large-scale visual generation and opens avenues for applying ZigMa to other linear-attention architectures and 3D data tasks, while acknowledging ethical considerations for content synthesis.

Abstract

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$ . Code will be released at https://taohu.me/zigma/

ZigMa: A DiT-style Zigzag Mamba Diffusion Model

TL;DR

This work tackles diffusion-model scalability limits in transformer-heavy backbones by leveraging a State-Space Model-based diffusion backbone, Mamba, extended to 2D images and 3D videos through a lightweight, zero-parameter Zigzag scanning scheme (ZigMa). Central to the approach is enforcing spatial continuity via per-layer zigzag token arrangements and augmenting with cross-attention to support text conditioning, all within the Stochastic Interpolant diffusion framework. The authors provide extensive ablations and demonstrate competitive or superior performance on high-resolution data (FacesHQ 1024x1024) and video benchmarks, with favorable speed and memory profiles compared to transformer-based backbones. The work suggests a practical path toward scalable, inductive-bias-aware diffusion for large-scale visual generation and opens avenues for applying ZigMa to other linear-attention architectures and 3D data tasks, while acknowledging ethical considerations for content synthesis.

Abstract

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ and UCF101, MultiModal-CelebA-HQ, and MS COCO . Code will be released at https://taohu.me/zigma/
Paper Structure (25 sections, 12 equations, 17 figures, 10 tables, 1 algorithm)

This paper contains 25 sections, 12 equations, 17 figures, 10 tables, 1 algorithm.

Figures (17)

  • Figure 1: Motivation. Our Zigzag Mamba method improves the network's position-awareness by arranging and rearranging the scan path of Mamba in a heuristic manner.
  • Figure 2: ZigMa. Our backbone is structured in L layers, mirroring the style of DiT dit_peebles2022scalable. We use the single-scan Mamba block as the primary reasoning module across different patches. To ensure the network is positionally aware, we've designed an arrange-rearrange scheme based on the single-scan Mamba. Different layers follow pairs of unique rearrange operation $\Omega$ and reverse rearrange $\Bar{\Omega}$, optimizing the position-awareness of the method.
  • Figure 3: The 2D Image Scan. Our mamba scan design is based on the sweep-scan scheme shown in subfigure (a). From this, we developed a zigzag-scan scheme displayed in subfigure (b) to enhance the continuity of the patches, thereby maximizing the potential of the Mamba block. Since there are several possible arrangements for these continuous scans, we have listed the eight most common zigzag-scans in subfigure (c).
  • Figure 4: The Detail of our Zigzag Mamba block. The detail of Mamba Scan is shown in Figure \ref{['fig:framework']}. The condition can include a timestep and a text prompt. These are fed into an MLP, which separately modulates the Mamba scan for long sequence modeling and cross-attention for multi-modal reasoning.
  • Figure 5: The 3D Video Scan.(a) We illustrate the bidirectional Mamba with the sweep scan, where the spatial and temporal information is treated as a set of tokens with a computer-hierarchy order. (b) For the 3D zigzag-scan, we aim to maximize the potential of Mamba by employing a spatial continuous scan scheme and adopting the optimal zigzag scan solution, as depicted in Figure \ref{['fig:img_scan']}. (c) We further separate the reasoning between spatial and temporal information, resulting in a factorized combination of 2D spatial scan ($\Omega$) plus a 1D temporal scan ($\Omega^{'}$) scheme.
  • ...and 12 more figures