Table of Contents
Fetching ...

Zero-shot Depth Completion via Test-time Alignment with Affine-invariant Depth Prior

Lee Hyoseok, Kyeong Seon Kim, Kwon Byung-Ki, Tae-Hyun Oh

TL;DR

The paper tackles the problem of domain-robust depth completion from sparse depth by leveraging affine-invariant depth priors learned by pre-trained monocular diffusion models. It introduces a zero-shot method that performs test-time alignment to fuse the diffusion-based depth prior with metric sparse measurements, enforcing hard data-consistency through an optimization loop and correcting the diffusion latent with scheduled noise. A novel prior-based outlier filtering and a loss suite including sparse-depth consistency, edge-aware smoothness, and Relative Structure Similarity (R-SSIM) help preserve scene structure and depth affinity across domains. Empirically, the approach demonstrates strong cross-domain generalization on indoor and outdoor datasets, surpasses several depth-prior-based and unsupervised baselines, and highlights the practical potential of foundation-model priors for robust depth completion without domain-specific training.

Abstract

Depth completion, predicting dense depth maps from sparse depth measurements, is an ill-posed problem requiring prior knowledge. Recent methods adopt learning-based approaches to implicitly capture priors, but the priors primarily fit in-domain data and do not generalize well to out-of-domain scenarios. To address this, we propose a zero-shot depth completion method composed of an affine-invariant depth diffusion model and test-time alignment. We use pre-trained depth diffusion models as depth prior knowledge, which implicitly understand how to fill in depth for scenes. Our approach aligns the affine-invariant depth prior with metric-scale sparse measurements, enforcing them as hard constraints via an optimization loop at test-time. Our zero-shot depth completion method demonstrates generalization across various domain datasets, achieving up to a 21\% average performance improvement over the previous state-of-the-art methods while enhancing spatial understanding by sharpening scene details. We demonstrate that aligning a monocular affine-invariant depth prior with sparse metric measurements is a proven strategy to achieve domain-generalizable depth completion without relying on extensive training data. Project page: https://hyoseok1223.github.io/zero-shot-depth-completion/.

Zero-shot Depth Completion via Test-time Alignment with Affine-invariant Depth Prior

TL;DR

The paper tackles the problem of domain-robust depth completion from sparse depth by leveraging affine-invariant depth priors learned by pre-trained monocular diffusion models. It introduces a zero-shot method that performs test-time alignment to fuse the diffusion-based depth prior with metric sparse measurements, enforcing hard data-consistency through an optimization loop and correcting the diffusion latent with scheduled noise. A novel prior-based outlier filtering and a loss suite including sparse-depth consistency, edge-aware smoothness, and Relative Structure Similarity (R-SSIM) help preserve scene structure and depth affinity across domains. Empirically, the approach demonstrates strong cross-domain generalization on indoor and outdoor datasets, surpasses several depth-prior-based and unsupervised baselines, and highlights the practical potential of foundation-model priors for robust depth completion without domain-specific training.

Abstract

Depth completion, predicting dense depth maps from sparse depth measurements, is an ill-posed problem requiring prior knowledge. Recent methods adopt learning-based approaches to implicitly capture priors, but the priors primarily fit in-domain data and do not generalize well to out-of-domain scenarios. To address this, we propose a zero-shot depth completion method composed of an affine-invariant depth diffusion model and test-time alignment. We use pre-trained depth diffusion models as depth prior knowledge, which implicitly understand how to fill in depth for scenes. Our approach aligns the affine-invariant depth prior with metric-scale sparse measurements, enforcing them as hard constraints via an optimization loop at test-time. Our zero-shot depth completion method demonstrates generalization across various domain datasets, achieving up to a 21\% average performance improvement over the previous state-of-the-art methods while enhancing spatial understanding by sharpening scene details. We demonstrate that aligning a monocular affine-invariant depth prior with sparse metric measurements is a proven strategy to achieve domain-generalizable depth completion without relying on extensive training data. Project page: https://hyoseok1223.github.io/zero-shot-depth-completion/.

Paper Structure

This paper contains 35 sections, 14 equations, 16 figures, 7 tables, 1 algorithm.

Figures (16)

  • Figure 1: Illustration of our approach. At test time, we align the depth affinity from the prior (dashed lines) with the sparse depth measurements as a hard constraint (bold lines). This alignment propagates measurements across the scene to complete unobservable depth values.
  • Figure 2: Test-time alignment process. We incorporate a two-step hard alignment process into the reverse sampling process including an optimization loop and resample at regular intervals. We optimize $\mathbf{z}_0(\mathbf{z}_t)$ and remap it to $\hat{\mathbf{z}}_t$. The latent is then decoded into depth, where the loss is measured against sparse depth. For visibility, the sparse depth points are enlarged.
  • Figure 3: Alignment with metric depth. We evaluate our method's effectiveness against ground truth (GT), accumulated semi-densely. We use only sparse depth (a) to align with actual metric depth values in complex scenes, ensuring a desirable solution. The white lines in (b), (c), and the x-axis of (d) represent pixel indices with valid depth points in a row of the GT.
  • Figure 4: Qualitative comparison on the nuScenes test set. In outdoor scenarios, our test-time alignment method performs robustly even under extreme weather conditions, clearly identifying critical elements such as vehicles and signs.
  • Figure 5: Qualitative comparison on the NYU test set. In indoor scenarios, our test-time alignment method accurately captures scene structures (e.g., chairs) compared to the existing test-time adaptation methods.
  • ...and 11 more figures