Table of Contents
Fetching ...

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang

TL;DR

MVGamba presents a general, lightweight Gaussian reconstruction model that unifies 3D content generation from images or text by leveraging a causal state-space sequence model (Mamba) to expand multi-view inputs into long Gaussian token sequences. The cross-view self-refinement enabled by the SSM-based reconstructor maintains multi-view consistency with linear complexity, and RotNet provides differentiable, stable rotation prediction for Gaussians. The approach achieves state-of-the-art results on image-to-3D, text-to-3D, and sparse-view reconstruction while using only about 0.1× the parameters of leading baselines, enabling sub-second generation when combined with standard multi-view diffusion models. This has practical implications for fast, unified 3D content creation in applications ranging from VR to gaming and animation, with potential extensions to scene and dynamic 3D generation.

Abstract

Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (e.g., Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1\times$ of the model size.

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

TL;DR

MVGamba presents a general, lightweight Gaussian reconstruction model that unifies 3D content generation from images or text by leveraging a causal state-space sequence model (Mamba) to expand multi-view inputs into long Gaussian token sequences. The cross-view self-refinement enabled by the SSM-based reconstructor maintains multi-view consistency with linear complexity, and RotNet provides differentiable, stable rotation prediction for Gaussians. The approach achieves state-of-the-art results on image-to-3D, text-to-3D, and sparse-view reconstruction while using only about 0.1× the parameters of leading baselines, enabling sub-second generation when combined with standard multi-view diffusion models. This has practical implications for fast, unified 3D content creation in applications ranging from VR to gaming and animation, with potential extensions to scene and dynamic 3D generation.

Abstract

Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (e.g., Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only of the model size.
Paper Structure (30 sections, 5 equations, 14 figures, 5 tables)

This paper contains 30 sections, 5 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: MVGamba is a unified 3D generation framework build on Gaussian Splatting, which can generate high-quality 3D contents in a feed-forward manner in sub-seconds.
  • Figure 2: (a) Previous Gaussian reconstruction models sacrifice the integrity of multi-view information for computationally intensive architectures, resulting in multi-view inconsistency and blurred textures. (b) Comparison of FLOPs between self-attention in Transformers and SSM in Mamba. Detailed FLOPs data are provided in Table \ref{['tab:compare']}.
  • Figure 3: (a) Multi-view Gaussian reconstructor (Sec. \ref{['sec:mvgamba_arch']}): Multi-view inputs with ray embedding are used for causal sequence modeling, predicting Gaussians rendered at novel views and supervised with ground truth images. (b) Unified inference pipeline (Sec. \ref{['sec:infer']}): MVGamba combines multi-view diffusion models and Gaussian reconstructor to generate high-quality 3D content in sub-seconds.
  • Figure 4: Qualitative comparison in image-to-3D and text-to-3D generation. Please refer to Appendix\ref{['app:result']} for more generation results.
  • Figure 5: Qualitative results in sparse-view reconstruction. Given four views as input, MVGamba effectively reconstructs both the geometric structure and detailed textures.
  • ...and 9 more figures