GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation
Ye Tao, Jiawei Zhang, Yahao Shi, Dongqing Zou, Bin Zhou
TL;DR
This work tackles single-image 3D object generation by bridging 2D diffusion diversity with explicit 3D structural constraints. It introduces GSV3D, which adds a Gaussian Splatting Decoder to the SV3D multi-view diffusion framework and enforces 3D consistency via geometric distillation, using RGB and depth supervision and DINO-informed cross-attention. The two-stage training—the decoder pretraining followed by LoRA-based distillation into the diffusion model—yields robust, multi-view-consistent 3D reconstructions with high appearance fidelity. Experiments on Objaverse and Google Scanned Objects demonstrate state-of-the-art multi-view consistency and strong generalization, with substantial improvements attributed to the explicit 3D representation and the 3D loss terms. The approach offers a scalable path to high-quality 3D content from single images, with practical implications for robotics, gaming, and immersive visualization.
Abstract
Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.
