Table of Contents
Fetching ...

Dimba: Transformer-Mamba Diffusion Models

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, Junshi Huang

TL;DR

Dimba introduces a Transformer–Mamba hybrid diffusion model for text-to-image generation, addressing memory and throughput limitations of pure Transformer approaches. By interleaving Transformer and Mamba blocks with cross-attention and AdaLN-based time embeddings, it achieves competitive image quality and semantic alignment while reducing compute and memory footprints. The authors present a large, auto-labeled dataset with quality-tuned captions, a staged training strategy with high-resolution adaptation, and techniques like PE interpolation to accelerate convergence. Extensive experiments, including FID, T2I-CompBench, and human/AI preference studies, demonstrate Dimba's efficiency and versatility, and ablations reveal the impact of data curation and architectural choices. The work suggests a promising direction for scalable, high-quality diffusion with hybrid backbones and provides practical guidance for resource-constrained text-to-image generation.

Abstract

This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control. We also report several intriguing properties of architecture discovered during evaluation and release checkpoints in experiments. Our findings emphasize the promise of large-scale hybrid Transformer-Mamba architectures in the foundational stage of diffusion models, suggesting a bright future for text-to-image generation.

Dimba: Transformer-Mamba Diffusion Models

TL;DR

Dimba introduces a Transformer–Mamba hybrid diffusion model for text-to-image generation, addressing memory and throughput limitations of pure Transformer approaches. By interleaving Transformer and Mamba blocks with cross-attention and AdaLN-based time embeddings, it achieves competitive image quality and semantic alignment while reducing compute and memory footprints. The authors present a large, auto-labeled dataset with quality-tuned captions, a staged training strategy with high-resolution adaptation, and techniques like PE interpolation to accelerate convergence. Extensive experiments, including FID, T2I-CompBench, and human/AI preference studies, demonstrate Dimba's efficiency and versatility, and ablations reveal the impact of data curation and architectural choices. The work suggests a promising direction for scalable, high-quality diffusion with hybrid backbones and provides practical guidance for resource-constrained text-to-image generation.

Abstract

This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control. We also report several intriguing properties of architecture discovered during evaluation and release checkpoints in experiments. Our findings emphasize the promise of large-scale hybrid Transformer-Mamba architectures in the foundational stage of diffusion models, suggesting a bright future for text-to-image generation.
Paper Structure (24 sections, 6 figures, 2 tables)

This paper contains 24 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Images generated from Dimba. The Dimba model can output high aesthetic, objective natural and consistent, and follow user textual instructions.
  • Figure 2: Model architecture of Dimba. Mamba and Transformer layers are interleaved in a stacked manner. The text feature is incorporated with a cross-attention layer. Time information is projected with shared MLP before inserting to different AdaLN layers.
  • Figure 3: Training data illustration and histogram visualization of the caption length. (a) Auto-labeling caption provides accurate textual descriptions for images, and we outline the valid nouns and verbs in red color; (b) We randomly select 1M captions from the raw captions and re-labeled captions to draw the corresponding histogram.
  • Figure 4: Qualitative comparison of Dimba with four other open-source text-to-image models. Baselines include Playground v2.5, PixArt, SDXL and SDXL Turbo. Images generated by Dimba are very competitive with these benchmarks and show more details and aesthetics.
  • Figure 5: User study and AI preference on fixed prompts. The ratio values indicate the percentages of participants preferring Dimba over the corresponding baselines. Dimba achieves a superior capacity in both image quality and prompt following.
  • ...and 1 more figures