Table of Contents
Fetching ...

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

Yunxiang Fu, Chaoqi Chen, Yizhou Yu

TL;DR

Local Attentional Mamba (LaMamba) blocks are introduced that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity, while achieving superior performance with comparable or fewer parameters.

Abstract

Recent Transformer-based diffusion models have shown remarkable performance, largely attributed to the ability of the self-attention mechanism to accurately capture both global and local contexts by computing all-pair interactions among input tokens. However, their quadratic complexity poses significant computational challenges for long-sequence inputs. Conversely, a recent state space model called Mamba offers linear complexity by compressing a filtered global context into a hidden state. Despite its efficiency, compression inevitably leads to information loss of fine-grained local dependencies among tokens, which are crucial for effective visual generative modeling. Motivated by these observations, we introduce Local Attentional Mamba (LaMamba) blocks that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity. Leveraging the efficient U-Net architecture, our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution, all while utilizing substantially fewer GFLOPs and a comparable number of parameters. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs compared to DiT-XL/2, while achieving superior performance with comparable or fewer parameters. Our code is available at https://github.com/yunxiangfu2001/LaMamba-Diff.

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

TL;DR

Local Attentional Mamba (LaMamba) blocks are introduced that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity, while achieving superior performance with comparable or fewer parameters.

Abstract

Recent Transformer-based diffusion models have shown remarkable performance, largely attributed to the ability of the self-attention mechanism to accurately capture both global and local contexts by computing all-pair interactions among input tokens. However, their quadratic complexity poses significant computational challenges for long-sequence inputs. Conversely, a recent state space model called Mamba offers linear complexity by compressing a filtered global context into a hidden state. Despite its efficiency, compression inevitably leads to information loss of fine-grained local dependencies among tokens, which are crucial for effective visual generative modeling. Motivated by these observations, we introduce Local Attentional Mamba (LaMamba) blocks that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity. Leveraging the efficient U-Net architecture, our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution, all while utilizing substantially fewer GFLOPs and a comparable number of parameters. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs compared to DiT-XL/2, while achieving superior performance with comparable or fewer parameters. Our code is available at https://github.com/yunxiangfu2001/LaMamba-Diff.
Paper Structure (13 sections, 3 equations, 14 figures, 7 tables)

This paper contains 13 sections, 3 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Unconditional image generation quality on ImageNet 256x256. The area of bubbles denote GFLOPs. Left: FID-50K of LaMamba-Diff models trained for 400k iterations. Performance improves with the number of parameters and GFLOPs. Right: Our largest model outperforms state-of-the-art diffusion models with substantially fewer GFLOPs.
  • Figure 2: Network architecture of LaMamba-Diff. Left: Architecture of LaMamba-Diff-S. Right: Local attentional Mamba block.
  • Figure 3: ImageNet $256 \times 256$ samples generated by LaMamba-Diff-XL using a classifier-free guidance scale of 4.0. Class: Ice Cream
  • Figure 4: ImageNet $256 \times 256$ samples generated by LaMamba-Diff-XL using a classifier-free guidance scale of 4.0. Class: Tabby Cat
  • Figure 5: ImageNet $256 \times 256$ samples generated by LaMamba-Diff-XL using a classifier-free guidance scale of 2.0. Class: Lakeshore
  • ...and 9 more figures