Table of Contents
Fetching ...

Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues

Chinmay Talegaonkar, Nikhil Gandudi Suresh, Zachary Novack, Yash Belhe, Priyanka Nagasamudra, Nicholas Antipa

TL;DR

Monocular metric depth estimation faces depth-scale ambiguity and poor zero-shot generalization. The paper repurposes a pre-trained diffusion prior, Marigold, by injecting defocus blur cues from two aperture images at inference and performing training-free optimization of metric depth scale and latent representations under a physics-based forward model. Formulated as an inverse problem with a defocus forward model, the approach achieves improved metric depth accuracy on a real, hardware-captured dataset while maintaining a strong generative prior. This training-free, physics-guided refinement widens the applicability of diffusion priors to metric depth tasks and offers a practical route to leverage depth cues without retraining.

Abstract

Recent monocular metric depth estimation (MMDE) methods have made notable progress towards zero-shot generalization. However, they still exhibit a significant performance drop on out-of-distribution datasets. We address this limitation by injecting defocus blur cues at inference time into Marigold, a \textit{pre-trained} diffusion model for zero-shot, scale-invariant monocular depth estimation (MDE). Our method effectively turns Marigold into a metric depth predictor in a training-free manner. To incorporate defocus cues, we capture two images with a small and a large aperture from the same viewpoint. To recover metric depth, we then optimize the metric depth scaling parameters and the noise latents of Marigold at inference time using gradients from a loss function based on the defocus-blur image formation model. We compare our method against existing state-of-the-art zero-shot MMDE methods on a self-collected real dataset, showing quantitative and qualitative improvements.

Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues

TL;DR

Monocular metric depth estimation faces depth-scale ambiguity and poor zero-shot generalization. The paper repurposes a pre-trained diffusion prior, Marigold, by injecting defocus blur cues from two aperture images at inference and performing training-free optimization of metric depth scale and latent representations under a physics-based forward model. Formulated as an inverse problem with a defocus forward model, the approach achieves improved metric depth accuracy on a real, hardware-captured dataset while maintaining a strong generative prior. This training-free, physics-guided refinement widens the applicability of diffusion priors to metric depth tasks and offers a practical route to leverage depth cues without retraining.

Abstract

Recent monocular metric depth estimation (MMDE) methods have made notable progress towards zero-shot generalization. However, they still exhibit a significant performance drop on out-of-distribution datasets. We address this limitation by injecting defocus blur cues at inference time into Marigold, a \textit{pre-trained} diffusion model for zero-shot, scale-invariant monocular depth estimation (MDE). Our method effectively turns Marigold into a metric depth predictor in a training-free manner. To incorporate defocus cues, we capture two images with a small and a large aperture from the same viewpoint. To recover metric depth, we then optimize the metric depth scaling parameters and the noise latents of Marigold at inference time using gradients from a loss function based on the defocus-blur image formation model. We compare our method against existing state-of-the-art zero-shot MMDE methods on a self-collected real dataset, showing quantitative and qualitative improvements.

Paper Structure

This paper contains 33 sections, 10 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Method overview. We capture two images (same viewpoint) from a camera with focal length $f$ and focused at a distance $F$: an all-in-focus (AIF) image $\mathbf{x}$ (F-stop: $N=22$) and a blurred image $\mathbf{x}_\text{b}$ (F-stop: $N<22$). Using the AIF $\mathbf{x}$ and an initial learnable noise vector $\mathbf{z}_{T}^{(\mathbf{d})}$, Marigold predicts the relative depth $\mathbf{d}$. We then affine transform $\mathbf{d}$ with learnable parameters ($\alpha,\beta$), obtaining the metric depth $\mathbf{d}^{\text{m}}$. Given the AIF $\mathbf{x}$, depth $\mathbf{d}^{\text{m}}$, and camera parameters ($f, F, N$), we synthesize a blurred image $\hat{\mathbf{x}}_\text{b}$ using the defocus blur forward model. To update the learnable parameters, we compute their gradients w.r.t the L2 loss between $\hat{\mathbf{x}}_\text{b}$ and $\mathbf{x}_\text{b}$.
  • Figure 2: Comparing simulated PSFs with the PSF captured from our camera setup. (a) A point source placed $d$ distance away from a thin lens focused at a focus distance $F$ produces a blurred image (PSF) with a diameter $c$, also known as the circle of confusion. The variation of $c$ with source distance $d$ is shown in the plot. (b) The Disc approximation to the camera PSF lies roughly within the same bounds (dotted red circle) as the PSF captured from the RGB camera (Real GT) in (c). The Gaussian PSF significantly exceeds the bounds. Slight differences between the real and Disc PSF stem from the octagonal aperture and diffraction ignored in our model. (c) We rigidly mount an Intel RealSense on a DSLR to capture ground truth depth, and calibrate both cameras to align predicted depth from the RGB image with the ground truth depth for evaluation.
  • Figure 3: Correcting texture-depth coupling. We assess MMDE performance on textured fronto-parallel 2D planes with constant ground truth depths (GT). Using an all-in-focus (a) and blurred image (zoom into insets) (b), our method (RMSE: 0.01) recovers the correct depth maps (c) for the two textured planes. We resolve the texture coupling in the Marigold prediction (Ours Init) and recover the correct metric scale. Competing methods (RMSE: 0.2-0.5) fail to predict both the constant relative depth map (except MLPro and Metric3D in row 1) and the correct scale.
  • Figure 4: Comparisons on our collected dataset. Our method consistently estimates accurate metric depth across all the scenes. We also observe better relative depth recovery due to leveraging defocus cues (zoom in 4x on blurred) in some regions (blue boxes, stairs). While the competing methods perform comparably to ours in some cases (MLPro:plane, kitchen, UniDepth:stairs, Metric3D:thordog, books), they struggle with the rest of the scenes due to incorrect relative depth (toys, books) and metric scale (plane) recovery. recovers sharp details but fails at metric scale and relative depth accuracy for many of the scenes. Since the RealSense has a wider FOV than the DSLR, we show a roughly aligned crop of the GT depth for comparison.
  • Figure 5: Analyzing the effect of different aperture sizes and initializations.Left: We use our forward model to simulate blurred images of a scene from the NYU-v2 dataset. We observe minimum depth error at $N=13$, with errors increasing at more extreme aperture values. Right: We plot $\delta_1$ at the end of optimization for various $\alpha,\beta$ initialization (normalized to 0-1). While the performance degrades for small values of $\alpha, \beta$ (bottom left), it is relatively stable for a broad range of initializations.
  • ...and 3 more figures