Table of Contents
Fetching ...

Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction

Qiuhong Shen, Zike Wu, Xuanyu Yi, Pan Zhou, Hanwang Zhang, Shuicheng Yan, Xinchao Wang

TL;DR

Gamba introduces an end-to-end, feed-forward pipeline for single-view 3D reconstruction that combines 3D Gaussian Splatting with the Mamba architecture to achieve millisecond-scale inference. A key innovation is the GambaFormer, which handles a large number of Gaussians with linear complexity, together with a radial mask constraint that eliminates the need for explicit 3D supervision. Trained on Objaverse and evaluated on the GSO dataset, Gamba matches or surpasses prior methods in quality while delivering orders-of-magnitude speedups over optimization-based approaches. This work significantly advances real-time 3D reconstruction from a single image and points toward scalable, deployable 3D generation systems.

Abstract

We tackle the challenge of efficiently reconstructing a 3D asset from a single image at millisecond speed. Existing methods for single-image 3D reconstruction are primarily based on Score Distillation Sampling (SDS) with Neural 3D representations. Despite promising results, these approaches encounter practical limitations due to lengthy optimizations and significant memory consumption. In this work, we introduce Gamba, an end-to-end 3D reconstruction model from a single-view image, emphasizing two main insights: (1) Efficient Backbone Design: introducing a Mamba-based GambaFormer network to model 3D Gaussian Splatting (3DGS) reconstruction as sequential prediction with linear scalability of token length, thereby accommodating a substantial number of Gaussians; (2) Robust Gaussian Constraints: deriving radial mask constraints from multi-view masks to eliminate the need for warmup supervision of 3D point clouds in training. We trained Gamba on Objaverse and assessed it against existing optimization-based and feed-forward 3D reconstruction approaches on the GSO Dataset, among which Gamba is the only end-to-end trained single-view reconstruction model with 3DGS. Experimental results demonstrate its competitive generation capabilities both qualitatively and quantitatively and highlight its remarkable speed: Gamba completes reconstruction within 0.05 seconds on a single NVIDIA A100 GPU, which is about $1,000\times$ faster than optimization-based methods. Please see our project page at https://florinshen.github.io/gamba-project.

Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction

TL;DR

Gamba introduces an end-to-end, feed-forward pipeline for single-view 3D reconstruction that combines 3D Gaussian Splatting with the Mamba architecture to achieve millisecond-scale inference. A key innovation is the GambaFormer, which handles a large number of Gaussians with linear complexity, together with a radial mask constraint that eliminates the need for explicit 3D supervision. Trained on Objaverse and evaluated on the GSO dataset, Gamba matches or surpasses prior methods in quality while delivering orders-of-magnitude speedups over optimization-based approaches. This work significantly advances real-time 3D reconstruction from a single image and points toward scalable, deployable 3D generation systems.

Abstract

We tackle the challenge of efficiently reconstructing a 3D asset from a single image at millisecond speed. Existing methods for single-image 3D reconstruction are primarily based on Score Distillation Sampling (SDS) with Neural 3D representations. Despite promising results, these approaches encounter practical limitations due to lengthy optimizations and significant memory consumption. In this work, we introduce Gamba, an end-to-end 3D reconstruction model from a single-view image, emphasizing two main insights: (1) Efficient Backbone Design: introducing a Mamba-based GambaFormer network to model 3D Gaussian Splatting (3DGS) reconstruction as sequential prediction with linear scalability of token length, thereby accommodating a substantial number of Gaussians; (2) Robust Gaussian Constraints: deriving radial mask constraints from multi-view masks to eliminate the need for warmup supervision of 3D point clouds in training. We trained Gamba on Objaverse and assessed it against existing optimization-based and feed-forward 3D reconstruction approaches on the GSO Dataset, among which Gamba is the only end-to-end trained single-view reconstruction model with 3DGS. Experimental results demonstrate its competitive generation capabilities both qualitatively and quantitatively and highlight its remarkable speed: Gamba completes reconstruction within 0.05 seconds on a single NVIDIA A100 GPU, which is about faster than optimization-based methods. Please see our project page at https://florinshen.github.io/gamba-project.
Paper Structure (13 sections, 8 equations, 8 figures, 2 tables)

This paper contains 13 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: (a): We propose Gamba, an end-to-end, feed-forward single-view reconstruction pipeline, which marries 3D Gaussian Splatting with Mamba to achieve fast reconstruction. (b): The relationship between the 3DGS iterative reconstruction and the Gamba sequential prediction pattern.
  • Figure 2: Overall architecture of Gamba. Gamba takes a single view image and its camera pose as input to predict the 3D Gaussian Splatting of the given subject. Training supervision is only applied on the rendered multi-view images through reconstruction loss.
  • Figure 3: Radial polygon mask. Object masks are divided into polygon masks by 2D ray casting from the image center to the contours.
  • Figure 4: Qualitative Comparison with large reconstruction models.
  • Figure 5: Comparison with Zero-1-to-3 liu2023zero based single-view 3D reconstruction methods, including feed-forward only method One-2-3-45 liu2023one and optimization-based DreamGaussian tang2023dreamgaussian.
  • ...and 3 more figures