Table of Contents
Fetching ...

MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

Wenchao Chen, Liqiang Niu, Ziyao Lu, Fandong Meng, Jie Zhou

TL;DR

MaskMamba addresses the quadratic complexity of Transformer-based image generation by introducing a hybrid Mamba-Transformer architecture optimized for Masked Image Modeling. It redesigns the Bi-Mamba block into Bi-Mamba-V2 with standard convolutions and concatenation, and studies serial and grouped parallel hybrid schemes along with in-context conditioning enabling class-to-image and text-to-image generation within a single model. The approach yields superior generation quality and markedly faster inference—up to 54.44% faster at 2048x2048—compared with Transformer baselines and prior Mamba-based methods. Results on ImageNet-1k and CC3M/COCO demonstrate MaskMamba as a scalable, flexible framework for high-fidelity, non-autoregressive image synthesis.

Abstract

Image generation models have encountered challenges related to scalability and quadratic complexity, primarily due to the reliance on Transformer-based backbones. In this study, we introduce MaskMamba, a novel hybrid model that combines Mamba and Transformer architectures, utilizing Masked Image Modeling for non-autoregressive image synthesis. We meticulously redesign the bidirectional Mamba architecture by implementing two key modifications: (1) replacing causal convolutions with standard convolutions to better capture global context, and (2) utilizing concatenation instead of multiplication, which significantly boosts performance while accelerating inference speed. Additionally, we explore various hybrid schemes of MaskMamba, including both serial and grouped parallel arrangements. Furthermore, we incorporate an in-context condition that allows our model to perform both class-to-image and text-to-image generation tasks. Our MaskMamba outperforms Mamba-based and Transformer-based models in generation quality. Notably, it achieves a remarkable $54.44\%$ improvement in inference speed at a resolution of $2048\times 2048$ over Transformer.

MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

TL;DR

MaskMamba addresses the quadratic complexity of Transformer-based image generation by introducing a hybrid Mamba-Transformer architecture optimized for Masked Image Modeling. It redesigns the Bi-Mamba block into Bi-Mamba-V2 with standard convolutions and concatenation, and studies serial and grouped parallel hybrid schemes along with in-context conditioning enabling class-to-image and text-to-image generation within a single model. The approach yields superior generation quality and markedly faster inference—up to 54.44% faster at 2048x2048—compared with Transformer baselines and prior Mamba-based methods. Results on ImageNet-1k and CC3M/COCO demonstrate MaskMamba as a scalable, flexible framework for high-fidelity, non-autoregressive image synthesis.

Abstract

Image generation models have encountered challenges related to scalability and quadratic complexity, primarily due to the reliance on Transformer-based backbones. In this study, we introduce MaskMamba, a novel hybrid model that combines Mamba and Transformer architectures, utilizing Masked Image Modeling for non-autoregressive image synthesis. We meticulously redesign the bidirectional Mamba architecture by implementing two key modifications: (1) replacing causal convolutions with standard convolutions to better capture global context, and (2) utilizing concatenation instead of multiplication, which significantly boosts performance while accelerating inference speed. Additionally, we explore various hybrid schemes of MaskMamba, including both serial and grouped parallel arrangements. Furthermore, we incorporate an in-context condition that allows our model to perform both class-to-image and text-to-image generation tasks. Our MaskMamba outperforms Mamba-based and Transformer-based models in generation quality. Notably, it achieves a remarkable improvement in inference speed at a resolution of over Transformer.
Paper Structure (17 sections, 1 equation, 11 figures, 7 tables)

This paper contains 17 sections, 1 equation, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Examples of class-conditional (top) and text-conditional (bottom) image generation using MaskMamba-XL.
  • Figure 2: MaskMamba Pipeline Overview.
  • Figure 3: (a) Structure of the original Mamba (gu2023mamba). (b) Bi-Mamba structure proposed in VisionMamba (zhu2024vision), which introduces a new branch specifically designed for vision tasks. (c) Our redesigned Mamba for masked image generation tasks by using standard convolution instead of causal convolution and replacing the final-stage multiplication with concatenation to reduce computation.
  • Figure 4: We design two categories of four hybrid configurations: grouped parallel and cascading serial. In parallel, the model is divided into two and four groups. In serial, we use a layer-wise interleaved structure of Bi-Mamba-v2 and Transformer, or the first $N/2$ layers are Bi-Mamba-v2 followed by $N/2$ layers of Transformer.
  • Figure 5: Examples of class-conditional image generation using MaskMamba-L (left) and MaskMamba-XL (right) with cfg=3.0, iterations=25.
  • ...and 6 more figures