Table of Contents
Fetching ...

MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

Phu Pham, Aradhya N. Mathur, Ojaswa Sharma, Aniket Bera

TL;DR

MVGaussian tackles the Janus ambiguity and heavy training costs in text-to-3D by uniting Score Distillation Sampling with explicit 3D Gaussian splatting, guided by multi-view imagery and depth-based surface densification. The method introduces a Gaussian alignment regularizer and an uncertainty-weighted loss to flatten and colocate gaussians with the surface, while backprojected depth drives progressive densification and pruning. Empirical results show photorealistic 3D outputs with significantly fewer gaussians and far shorter training times (around 25 minutes on an A100) compared to prior 3D diffusion approaches, supported by human evaluation. The work demonstrates robust, fast text-to-3D generation across diverse prompts and offers practical improvements for rapid prototyping in content creation and related industries.

Abstract

The field of text-to-3D content generation has made significant progress in generating realistic 3D objects, with existing methodologies like Score Distillation Sampling (SDS) offering promising guidance. However, these methods often encounter the "Janus" problem-multi-face ambiguities due to imprecise guidance. Additionally, while recent advancements in 3D gaussian splitting have shown its efficacy in representing 3D volumes, optimization of this representation remains largely unexplored. This paper introduces a unified framework for text-to-3D content generation that addresses these critical gaps. Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model, progressively enhancing detail and accuracy. We also introduce a novel densification algorithm that aligns gaussians close to the surface, optimizing the structural integrity and fidelity of the generated models. Extensive experiments validate our approach, demonstrating that it produces high-quality visual outputs with minimal time cost. Notably, our method achieves high-quality results within half an hour of training, offering a substantial efficiency gain over most existing methods, which require hours of training time to achieve comparable results.

MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

TL;DR

MVGaussian tackles the Janus ambiguity and heavy training costs in text-to-3D by uniting Score Distillation Sampling with explicit 3D Gaussian splatting, guided by multi-view imagery and depth-based surface densification. The method introduces a Gaussian alignment regularizer and an uncertainty-weighted loss to flatten and colocate gaussians with the surface, while backprojected depth drives progressive densification and pruning. Empirical results show photorealistic 3D outputs with significantly fewer gaussians and far shorter training times (around 25 minutes on an A100) compared to prior 3D diffusion approaches, supported by human evaluation. The work demonstrates robust, fast text-to-3D generation across diverse prompts and offers practical improvements for rapid prototyping in content creation and related industries.

Abstract

The field of text-to-3D content generation has made significant progress in generating realistic 3D objects, with existing methodologies like Score Distillation Sampling (SDS) offering promising guidance. However, these methods often encounter the "Janus" problem-multi-face ambiguities due to imprecise guidance. Additionally, while recent advancements in 3D gaussian splitting have shown its efficacy in representing 3D volumes, optimization of this representation remains largely unexplored. This paper introduces a unified framework for text-to-3D content generation that addresses these critical gaps. Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model, progressively enhancing detail and accuracy. We also introduce a novel densification algorithm that aligns gaussians close to the surface, optimizing the structural integrity and fidelity of the generated models. Extensive experiments validate our approach, demonstrating that it produces high-quality visual outputs with minimal time cost. Notably, our method achieves high-quality results within half an hour of training, offering a substantial efficiency gain over most existing methods, which require hours of training time to achieve comparable results.
Paper Structure (24 sections, 19 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 19 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of our MVGaussian framework: Our approach begins with the random initialization of gaussians within a unit sphere, refined iteratively using an SDS-based optimization strategy. gaussians are optimized near the true surface, moving toward the pseudo surface while pruning those farther away. Each iteration renders four views with random azimuth angles, encoded into the latent space. Gaussian noise is added and denoised using a UNET model to compute the loss $\mathcal{L}_{sds}$. The optimization gradient $\nabla \mathcal{L}_{sds}$ updates the gaussians, incorporating a feedback loop with fused point cloud data and voxel downsampling to enhance accuracy.
  • Figure 2: We show extensive qualitative results in the figure above and show comparisons against several state-of-the-art methods. We show consistent improvement across all different prompts tested and demonstrate the effectiveness of our densification approach.
  • Figure 3: Additional qaulitative comparisons with several state-of-the-art methods.
  • Figure 4: Evaluation of various aspects of the generated 3D content across different text-to-3D models based on human assessments.
  • Figure 5: Images with and without additional losses
  • ...and 5 more figures