RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting
Ruocheng Wu, Haolan He, Yufei Wang, Zhihao Li, Bihan Wen
TL;DR
This work tackles overfitting in few-shot 3D Gaussian Splatting by leveraging pretrained Video Diffusion Models as multi-view priors. It introduces Guidance Score Distillation (GSD), a training-free framework that corrects VDM noise predictions with a unified guidance mechanism, including Depth Warp Guidance and Semantic Feature Guidance, to align updates with accurate geometry and camera poses. By applying a DDIM-based distillation scheme to multiple frames and using semantic (DINO) features plus depth warping, GSD achieves improved view consistency and rendering quality across LLFF, Mip-NeRF360, and DTU datasets, outperforming prior diffusion-augmented approaches. The approach avoids fine-tuning the diffusion model while delivering robust few-shot 3D reconstructions, with ablations validating the contributions of semantic guidance, depth warping, and expanded camera trajectories.
Abstract
3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.
