Table of Contents
Fetching ...

Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision

Yu Deng, Baozhu Zhao, Junyan Su, Xiaohan Zhang, Qi Liu

TL;DR

This work tackles depth inconsistency in 3D reconstruction across extreme depth ranges by coupling physics-based depth-of-field supervision with cross-view geometric constraints in a 3D Gaussian Splatting framework. It introduces a differentiable defocus model with adaptive kernels, a depth-of-field loss, and gradient-aware density control to preserve near-field structure while improving far-field depth coherence. A global monocular depth scale recovery together with local grid-based depth restoration and LoFTR-based feature matching enforces metric and multi-view consistency, leading to state-of-the-art results on Waymo and strong performance on unbounded scenes. The framework meaningfully bridges optical imaging physics and learning-based depth regularization, offering a scalable approach for depth-aware urban scene reconstruction with practical efficiency.

Abstract

Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.

Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision

TL;DR

This work tackles depth inconsistency in 3D reconstruction across extreme depth ranges by coupling physics-based depth-of-field supervision with cross-view geometric constraints in a 3D Gaussian Splatting framework. It introduces a differentiable defocus model with adaptive kernels, a depth-of-field loss, and gradient-aware density control to preserve near-field structure while improving far-field depth coherence. A global monocular depth scale recovery together with local grid-based depth restoration and LoFTR-based feature matching enforces metric and multi-view consistency, leading to state-of-the-art results on Waymo and strong performance on unbounded scenes. The framework meaningfully bridges optical imaging physics and learning-based depth regularization, offering a scalable approach for depth-aware urban scene reconstruction with practical efficiency.

Abstract

Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.

Paper Structure

This paper contains 47 sections, 43 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: A schematic illustrating the principle of depth of field blur. When a scene point M at a distance d does not lie on the focus plane (at distance $d_f$), it creates a blurred spot on the image plane known as the circle of confusion (with a diameter of $D^{coc}$), causing the image to be out of focus. f represents the focal length of the lens.
  • Figure 2: Our framework consists of two core technical components: (a) Depth-of-Field Supervision (Blue Flow) addressing inaccuracies in distant scenes and difficulties in recovering structures in near-field scenes. The pipeline takes multi-view images as input, obtains scale-ambiguous depth predictions through a monocular depth estimator (e.g., Metric3D), and calculates true depth maps via a multi-view depth scale recovery algorithm. Defocus convolution is then utilized to generate defocused images from both rendered and ground truth images, with the final $\mathcal{L}_{\text{dof}}$ loss between these defocused images supervising the 3DGS training. (b) Multi-View Consistency Supervision (Orange Flow) resolving cross-view geometric alignment issues. Initially, semi-dense feature matching is performed across multi-view images using LoFTR, minimizing the error $\mathcal{L}_{\text{geo}}$ between 3D points corresponding to matched pixels to enhance cross-view geometric consistency. Simultaneously, a depth consistency loss $\mathcal{L}_{\text{depth}}$ employs local depth maps recovered through least squares optimization from accurately matched points with reliable depth information to optimize the depth rendered by 3DGS.
  • Figure 3: Illustration of polygonal aperture mechanisms in a camera lens: (a) octagon aperture blades and (b) dodecagon aperture blades.
  • Figure 4: Comparative analysis of defocus convolution techniques: (a) Original (no blur) provides baseline sharpness; (b) implementing radially symmetric blur via bell-shaped intensity profiles to simulate natural defocus; (c) preserving edge structures through S-curve transitions using hyperbolic tangent functions; (d) emulating optical apertures with geometric containment and radial attenuation for realistic bokeh effects.
  • Figure 5: Schematic Diagram of Geometric Consistency Loss Calculation: Feature points matched in two views are projected into 3D space using rendered depth, and geometric constraints are imposed by minimizing the distance between these projected points.
  • ...and 5 more figures