Table of Contents
Fetching ...

Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors

Hritam Basak, Hadi Tabatabaee, Shreekant Gayaka, Ming-Feng Li, Xin Yang, Cheng-Hao Kuo, Arnie Sen, Min Sun, Zhaozheng Yin

TL;DR

This work introduces a two-stage frequency-based distillation loss integrated with Gaussian Splatting, leveraging geometric priors from a 3D diffusion model’s low-frequency spectrum for structural consistency and a 2D diffusion model’s high-frequency details for sharper textures to achieve state-of-the-art 3D reconstruction quality.

Abstract

3D object generation from a single image involves estimating the full 3D geometry and texture of unseen views from an unposed RGB image captured in the wild. Accurately reconstructing an object's complete 3D structure and texture has numerous applications in real-world scenarios, including robotic manipulation, grasping, 3D scene understanding, and AR/VR. Recent advancements in 3D object generation have introduced techniques that reconstruct an object's 3D shape and texture by optimizing the efficient representation of Gaussian Splatting, guided by pre-trained 2D or 3D diffusion models. However, a notable disparity exists between the training datasets of these models, leading to distinct differences in their outputs. While 2D models generate highly detailed visuals, they lack cross-view consistency in geometry and texture. In contrast, 3D models ensure consistency across different views but often result in overly smooth textures. We propose bridging the gap between 2D and 3D diffusion models to address this limitation by integrating a two-stage frequency-based distillation loss with Gaussian Splatting. Specifically, we leverage geometric priors in the low-frequency spectrum from a 3D diffusion model to maintain consistent geometry and use a 2D diffusion model to refine the fidelity and texture in the high-frequency spectrum of the generated 3D structure, resulting in more detailed and fine-grained outcomes. Our approach enhances geometric consistency and visual quality, outperforming the current SOTA. Additionally, we demonstrate the easy adaptability of our method for efficient object pose estimation and tracking.

Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors

TL;DR

This work introduces a two-stage frequency-based distillation loss integrated with Gaussian Splatting, leveraging geometric priors from a 3D diffusion model’s low-frequency spectrum for structural consistency and a 2D diffusion model’s high-frequency details for sharper textures to achieve state-of-the-art 3D reconstruction quality.

Abstract

3D object generation from a single image involves estimating the full 3D geometry and texture of unseen views from an unposed RGB image captured in the wild. Accurately reconstructing an object's complete 3D structure and texture has numerous applications in real-world scenarios, including robotic manipulation, grasping, 3D scene understanding, and AR/VR. Recent advancements in 3D object generation have introduced techniques that reconstruct an object's 3D shape and texture by optimizing the efficient representation of Gaussian Splatting, guided by pre-trained 2D or 3D diffusion models. However, a notable disparity exists between the training datasets of these models, leading to distinct differences in their outputs. While 2D models generate highly detailed visuals, they lack cross-view consistency in geometry and texture. In contrast, 3D models ensure consistency across different views but often result in overly smooth textures. We propose bridging the gap between 2D and 3D diffusion models to address this limitation by integrating a two-stage frequency-based distillation loss with Gaussian Splatting. Specifically, we leverage geometric priors in the low-frequency spectrum from a 3D diffusion model to maintain consistent geometry and use a 2D diffusion model to refine the fidelity and texture in the high-frequency spectrum of the generated 3D structure, resulting in more detailed and fine-grained outcomes. Our approach enhances geometric consistency and visual quality, outperforming the current SOTA. Additionally, we demonstrate the easy adaptability of our method for efficient object pose estimation and tracking.

Paper Structure

This paper contains 20 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 2: We improve single image to 3D generation, maintaining both geometric consistency and superior texture: the input image (col. 1), rendered multi-view images (col. 2-5), generated normal map (col. 6). We encourage readers check supplementary file for video results.
  • Figure 3: Overall workflow of our proposed method: First, a reference image is passed through an NVS pipeline Zero123++ shi2023zero123++ (Sec. \ref{['nvs_stage']}), followed by a reconstruction stage which involves a coarse Gaussian initialization using LGM tang2024lgm and diffusion-prior based optimization ( Sec. \ref{['reconstruction']}).
  • Figure 4: Amplitude analysis of 2D and 3D prior-based diffusion models: Stable Diffusion (SD) rombach2022high and Zero123 liu2023zero. Clearly, 3D-prior based Zero123 demonstrates blurry/smooth output, i.e., low-frequency amplitude spectrum, whereas SD demonstrates high-frequency components, i.e., sharp outputs.
  • Figure 5: Qualitative comparison of our proposed method with InstantMesh xu2024instantmesh which constantly produces unrealistic outputs with distorted face, asymmetric structure (panda), missing realism (bag), and artefacts (bedroom). We encourage readers check supplementary file for video results.
  • Figure 6: Qualitative ablation outcomes to assess the contribution of individual loss components. The columns represent the input image, output with only 2D prior ($L_{SDS}^{HF}$), output with only 3D prior ($L_{SDS}^{LF}$), and our proposed hybrid frequency-based SDS output, respectively.
  • ...and 1 more figures