Depth Estimation Based on 3D Gaussian Splatting Siamese Defocus
Jinchang Zhang, Ningning Xu, Hao Zhang, Guoyu Lu
TL;DR
This work tackles monocular depth estimation by leveraging defocus cues without requiring multi-image focus stacks. It introduces a self-supervised framework that unifies a Siamese Defocus Network (SDNet) for defocus mapping with a 3D Gaussian Splatting renderer, guided by a camera-lens model to produce synthetic defocus and supervision via blur reconstruction. DepthNet then refines a depth prediction using the learned defocus maps and an initial depth from the splatting stage, optimizing with defocus, blur, and reconstruction losses. The approach achieves competitive or superior results on FoD500 and NYUv2 with a single defocused input, and demonstrates practical potential for real-world depth estimation where rapid focus adjustments are impractical.
Abstract
Depth estimation is a fundamental task in 3D geometry. While stereo depth estimation can be achieved through triangulation methods, it is not as straightforward for monocular methods, which require the integration of global and local information. The Depth from Defocus (DFD) method utilizes camera lens models and parameters to recover depth information from blurred images and has been proven to perform well. However, these methods rely on All-In-Focus (AIF) images for depth estimation, which is nearly impossible to obtain in real-world applications. To address this issue, we propose a self-supervised framework based on 3D Gaussian splatting and Siamese networks. By learning the blur levels at different focal distances of the same scene in the focal stack, the framework predicts the defocus map and Circle of Confusion (CoC) from a single defocused image, using the defocus map as input to DepthNet for monocular depth estimation. The 3D Gaussian splatting model renders defocused images using the predicted CoC, and the differences between these and the real defocused images provide additional supervision signals for the Siamese Defocus self-supervised network. This framework has been validated on both artificially synthesized and real blurred datasets. Subsequent quantitative and visualization experiments demonstrate that our proposed framework is highly effective as a DFD method.
