Table of Contents
Fetching ...

GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

Ye Tao, Jiawei Zhang, Yahao Shi, Dongqing Zou, Bin Zhou

TL;DR

This work tackles single-image 3D object generation by bridging 2D diffusion diversity with explicit 3D structural constraints. It introduces GSV3D, which adds a Gaussian Splatting Decoder to the SV3D multi-view diffusion framework and enforces 3D consistency via geometric distillation, using RGB and depth supervision and DINO-informed cross-attention. The two-stage training—the decoder pretraining followed by LoRA-based distillation into the diffusion model—yields robust, multi-view-consistent 3D reconstructions with high appearance fidelity. Experiments on Objaverse and Google Scanned Objects demonstrate state-of-the-art multi-view consistency and strong generalization, with substantial improvements attributed to the explicit 3D representation and the 3D loss terms. The approach offers a scalable path to high-quality 3D content from single images, with practical implications for robotics, gaming, and immersive visualization.

Abstract

Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.

GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

TL;DR

This work tackles single-image 3D object generation by bridging 2D diffusion diversity with explicit 3D structural constraints. It introduces GSV3D, which adds a Gaussian Splatting Decoder to the SV3D multi-view diffusion framework and enforces 3D consistency via geometric distillation, using RGB and depth supervision and DINO-informed cross-attention. The two-stage training—the decoder pretraining followed by LoRA-based distillation into the diffusion model—yields robust, multi-view-consistent 3D reconstructions with high appearance fidelity. Experiments on Objaverse and Google Scanned Objects demonstrate state-of-the-art multi-view consistency and strong generalization, with substantial improvements attributed to the explicit 3D representation and the 3D loss terms. The approach offers a scalable path to high-quality 3D content from single images, with practical implications for robotics, gaming, and immersive visualization.

Abstract

Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.

Paper Structure

This paper contains 16 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of GSV3D Training and Inference Pipeline. During inference, given an initialized noise latent $z_T$, an input image $\mathbf{R}$ and its corresponding camera pose $c$, the approach follows a structured pipeline: At each step of the sampling, the input image is encoded using VAE $\mathcal{E}$, and its latent representation is concatenated with input noisy multi-view latents $z_t$. Simultaneously, the image $\mathbf{R}$ is encoded via CLIP and, along with the camera pose $c$, integrated into the diffusion process via cross-attention. For the last step, the multi-view denoised latents $\hat{z_0}$ are processed by the Gaussian Splatting Decoder to reconstruct a 3D representation $GS_0$, as shown in (a). During training, the geometric distillation process utilizes the Gaussian Splatting Decoder with 3D constrain to distill geometric knowledge from the multi-view latent representations $\hat{z_t}$ generated by the multi-view diffusion model at each diffusion step, which then serves as supervision $\mathcal{L}_{\text{3D}}$ to guide the learning of 3D geometry. At the same time, the multi-view latent loss $\mathcal{L}_{\text{2D}}$ is imposed in latent space, as shown in (b).
  • Figure 2: Overview of the Gaussian Splatting Decoder pipeline. During both geometric distillation and GSV3D inference, the conditioning image $\mathbf{R}$ is first processed by a pre-trained DINO encoder $\mathcal{DINO}$ to extract features, which are then integrated into the ViT via cross-attention. The ViT processes the latents $\hat{z_{t}}$ generated by GSV3D. The output of the ViT is passed through an upsampler before being converted into a 3D Gaussian Splatting representation. In order to train the Gaussian Splatting Decoder, the latents $\hat{z_{t}}$ are substituted with the latents $z_{input}$ encoded from the input multi-view images $I_{input}$ using a pre-trained VAE encoder $\mathcal{E}$. These input multi-view images consist of $N$ images, each corresponding to a viewpoint uniformly distributed around the object.
  • Figure 3: Performance comparison between our GSV3D and other state-of-art methods. GA and TGS are abbreviations for GaussianAnything and TriplaneGaussian, respectively. For each example, the first two columns display two different rendering views of the generated 3D representation, while the third column shows the rendered image of the extracted Mesh.
  • Figure 4: Examples of using GSV3D for text-to-image-to-3D generation.
  • Figure 5: Visual comparison for the number of frames $N$ in the multi-view latents generated by denoising UNet $\epsilon_\theta$. Reducing the number of frames weakens generative capability and 3D constraints, leading to ghosting and blurry regions in the outputs of GSV3D.