Table of Contents
Fetching ...

High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model

Yiyang Shen, Kun Zhou, He Wang, Yin Yang, Tianjia Shao

TL;DR

GS-RGBN addresses the challenge of high-fidelity single-view 3D reconstruction by introducing an RGBN-volume Gaussian framework that couples a 3D-native Hybrid Voxel-Gaussian representation with a Cross-volume Fusion module to fuse RGB and surface-normal cues. By decoding per-voxel 2D Gaussians from a fused RGBN volume and supervising with color and depth losses, the method achieves view-consistent geometry and fast rendering. Key contributions include the structured 3D voxel grid for unstructured Gaussians, the CVF for RGB–normal fusion, and comprehensive ablations showing the necessity of both the 3D structure and multimodal fusion. Empirical results on the GSO dataset demonstrate superior novel-view synthesis and single-view reconstruction quality with practical runtime, indicating strong potential for rapid 3D asset creation from a single image.

Abstract

Recently single-view 3D generation via Gaussian splatting has emerged and developed quickly. They learn 3D Gaussians from 2D RGB images generated from pre-trained multi-view diffusion (MVD) models, and have shown a promising avenue for 3D generation through a single image. Despite the current progress, these methods still suffer from the inconsistency jointly caused by the geometric ambiguity in the 2D images, and the lack of structure of 3D Gaussians, leading to distorted and blurry 3D object generation. In this paper, we propose to fix these issues by GS-RGBN, a new RGBN-volume Gaussian Reconstruction Model designed to generate high-fidelity 3D objects from single-view images. Our key insight is a structured 3D representation can simultaneously mitigate the afore-mentioned two issues. To this end, we propose a novel hybrid Voxel-Gaussian representation, where a 3D voxel representation contains explicit 3D geometric information, eliminating the geometric ambiguity from 2D images. It also structures Gaussians during learning so that the optimization tends to find better local optima. Our 3D voxel representation is obtained by a fusion module that aligns RGB features and surface normal features, both of which can be estimated from 2D images. Extensive experiments demonstrate the superiority of our methods over prior works in terms of high-quality reconstruction results, robust generalization, and good efficiency.

High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model

TL;DR

GS-RGBN addresses the challenge of high-fidelity single-view 3D reconstruction by introducing an RGBN-volume Gaussian framework that couples a 3D-native Hybrid Voxel-Gaussian representation with a Cross-volume Fusion module to fuse RGB and surface-normal cues. By decoding per-voxel 2D Gaussians from a fused RGBN volume and supervising with color and depth losses, the method achieves view-consistent geometry and fast rendering. Key contributions include the structured 3D voxel grid for unstructured Gaussians, the CVF for RGB–normal fusion, and comprehensive ablations showing the necessity of both the 3D structure and multimodal fusion. Empirical results on the GSO dataset demonstrate superior novel-view synthesis and single-view reconstruction quality with practical runtime, indicating strong potential for rapid 3D asset creation from a single image.

Abstract

Recently single-view 3D generation via Gaussian splatting has emerged and developed quickly. They learn 3D Gaussians from 2D RGB images generated from pre-trained multi-view diffusion (MVD) models, and have shown a promising avenue for 3D generation through a single image. Despite the current progress, these methods still suffer from the inconsistency jointly caused by the geometric ambiguity in the 2D images, and the lack of structure of 3D Gaussians, leading to distorted and blurry 3D object generation. In this paper, we propose to fix these issues by GS-RGBN, a new RGBN-volume Gaussian Reconstruction Model designed to generate high-fidelity 3D objects from single-view images. Our key insight is a structured 3D representation can simultaneously mitigate the afore-mentioned two issues. To this end, we propose a novel hybrid Voxel-Gaussian representation, where a 3D voxel representation contains explicit 3D geometric information, eliminating the geometric ambiguity from 2D images. It also structures Gaussians during learning so that the optimization tends to find better local optima. Our 3D voxel representation is obtained by a fusion module that aligns RGB features and surface normal features, both of which can be estimated from 2D images. Extensive experiments demonstrate the superiority of our methods over prior works in terms of high-quality reconstruction results, robust generalization, and good efficiency.

Paper Structure

This paper contains 14 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: GS-RGBN is an RGBN-volume Gaussian reconstruction model that generates high-quality 2D Gaussians (middle) using a single image (left). The textured meshes can be reconstructed from the generated 2D Gaussians optionally (right).
  • Figure 2: The overview of our paradigm. Given a single image of a 3D object, we first input it into an off-the-shelf multi-view diffusion model (Wonder3D long2024wonder3d) to obtain two sets of multi-view normal and RGB images, which are used to build the hybrid Voxel-Gaussian model. Especially, we input these images to pre-trained VIT DINO models caron2021emerging and lift extracted 2D DINO features to build two 3D feature volumes, i.e., RGB feature volume $V_{rgb}$ and normal feature volume $V_{nor}$ modulated by Plücker rays (Sec. \ref{['SecM.1']}). Next, a feature-level cross-volume fusion (CVF) module is capable of effectively fusing the RGB and normal volumetric features to obtain the fine-grained fused RGBN feature volume $V_{rgbn}$ (Sec. \ref{['SecM.2']}). Finally, we use several MLPs for decoding $V_{rgbn}$ to regress 2D Gaussian primitives for novel view rendering (Sec. \ref{['SecM.3']}). Notably, the training process is supervised by color, depth and regularization loss functions (Sec. \ref{['SecM.4']}).
  • Figure 3: The illustration of the structure of the cross-volume fusion (CVF) module.
  • Figure 4: Qualitative comparisons of novel view synthesis between GS-RGBN and other methods on the GSO dataset. It can be observed that the 3D objects reconstructed by our method have both high-quality and consistent details.
  • Figure 5: Qualitative comparisons of single view reconstruction between GS-RGBN and other methods on the GSO dataset.
  • ...and 1 more figures