Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

Shentong Mo

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

Shentong Mo

TL;DR

This work tackles the scalability bottlenecks of diffusion‑based 3D shape generation by introducing Diffusion Mamba for 3D (DiM‑3D), a framework that substitutes attention‑heavy diffusion transformers with linear‑complexity bidirectional State Space Models (Mamba). By patching voxelized point clouds into token sequences and applying DiM‑3D blocks, the model maintains $O(L)$ complexity while producing high‑fidelity, diverse 3D shapes and excelling in point‑cloud completion. Experiments on ShapeNet (chairs, airplanes, cars) show state‑of‑the‑art performance, and ablations demonstrate substantial reductions in Gflops and strong scalability across model sizes and datasets (e.g., ShapeNet‑55). The proposed approach sets a new standard for efficient, high‑resolution 3D generation, with broad implications for industrial design, VR/AR, and autonomous systems.

Abstract

Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their potential, face scalability challenges due to the cubic complexity of attention operations as input length increases. This complexity becomes a significant hurdle when dealing with high-resolution voxel sizes. To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D). This architecture forgoes traditional attention mechanisms, instead utilizing the inherent efficiency of the Mamba architecture to maintain linear complexity with respect to sequence length. DiM-3D is characterized by fast inference times and substantially lower computational demands, quantified in reduced Gflops, thereby addressing the key scalability issues of prior models. Our empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes. Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud completion. This not only proves the model's scalability but also underscores its efficiency in generating detailed, high-resolution voxels necessary for advanced 3D shape modeling, particularly excelling in environments requiring high-resolution voxel sizes. Through these findings, we illustrate the exceptional scalability and efficiency of the Diffusion Mamba framework in 3D shape generation, setting a new standard for the field and paving the way for future explorations in high-resolution 3D modeling technologies.

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

TL;DR

complexity while producing high‑fidelity, diverse 3D shapes and excelling in point‑cloud completion. Experiments on ShapeNet (chairs, airplanes, cars) show state‑of‑the‑art performance, and ablations demonstrate substantial reductions in Gflops and strong scalability across model sizes and datasets (e.g., ShapeNet‑55). The proposed approach sets a new standard for efficient, high‑resolution 3D generation, with broad implications for industrial design, VR/AR, and autonomous systems.

Abstract

Paper Structure (11 sections, 5 equations, 2 figures, 6 tables)

This paper contains 11 sections, 5 equations, 2 figures, 6 tables.

Introduction
Related Work
Method
Preliminaries
Diffusion Mamba for 3D Shape Generation
DiM-3D Block with Linear Complexity
Experiments
Experimental Setup
Comparison to prior work
Experimental analysis
Conclusion

Figures (2)

Figure 1: Illustration of the proposed Diffusion Mamba framework for 3D shape generation (DiM-3D). The framework takes voxelized point clouds as input, and a patchification operator is used to generate token-level patch embeddings. Then, multiple DiM-3D blocks based on Mamba with bidirectional state space models (SSMs) extract representations from all input tokens. Finally, a linear layer and a devoxelization operator are used to predict the noise in the point cloud space.
Figure 2: Qualitative visualizations of generated 3D point clouds. Our DiM-3D achieves high-fidelity and diverse 3D point cloud generation across different categories.

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

TL;DR

Abstract

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

Authors

TL;DR

Abstract

Table of Contents

Figures (2)