Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision
Yu Deng, Baozhu Zhao, Junyan Su, Xiaohan Zhang, Qi Liu
TL;DR
This work tackles depth inconsistency in 3D reconstruction across extreme depth ranges by coupling physics-based depth-of-field supervision with cross-view geometric constraints in a 3D Gaussian Splatting framework. It introduces a differentiable defocus model with adaptive kernels, a depth-of-field loss, and gradient-aware density control to preserve near-field structure while improving far-field depth coherence. A global monocular depth scale recovery together with local grid-based depth restoration and LoFTR-based feature matching enforces metric and multi-view consistency, leading to state-of-the-art results on Waymo and strong performance on unbounded scenes. The framework meaningfully bridges optical imaging physics and learning-based depth regularization, offering a scalable approach for depth-aware urban scene reconstruction with practical efficiency.
Abstract
Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.
