Table of Contents
Fetching ...

Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Ananta R. Bhattarai, Helge Rhodin

TL;DR

This work tackles the domain gap in monocular depth estimation by introducing Re-Depth Anything, a test-time refinement framework that re-lights DA-V2 depth predictions and leverages a 2D diffusion prior via Score Distillation Sampling to self-supervise geometry without extra labels. The method targets only the intermediate embeddings and decoder weights while freezing the encoder, and uses depth ensembling across multiple re-lighting runs to stabilize the final output. Across CO3Dv2, KITTI, and ETH3D, it yields consistent quantitative improvements and richer visual details over DA-V2, while qualitative analyses show reduced noise and better structural fidelity. Limitations include occasional oversmoothing and sky artifacts, suggesting avenues for improved shading cues and adaptive regularization in future work.

Abstract

Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.

Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

TL;DR

This work tackles the domain gap in monocular depth estimation by introducing Re-Depth Anything, a test-time refinement framework that re-lights DA-V2 depth predictions and leverages a 2D diffusion prior via Score Distillation Sampling to self-supervise geometry without extra labels. The method targets only the intermediate embeddings and decoder weights while freezing the encoder, and uses depth ensembling across multiple re-lighting runs to stabilize the final output. Across CO3Dv2, KITTI, and ETH3D, it yields consistent quantitative improvements and richer visual details over DA-V2, while qualitative analyses show reduced noise and better structural fidelity. Limitations include occasional oversmoothing and sky artifacts, suggesting avenues for improved shading cues and adaptive regularization in future work.

Abstract

Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.

Paper Structure

This paper contains 34 sections, 13 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Re-Depth Anything refines the prediction of Depth Anything V2 Yang2024DepthAV by re-lighting the reconstructed geometry and extracting knowledge from diffusion models in a self-supervised manner. In this example, the test-time optimization enhances facial detail (see frontal view) and refines the nose shape to look more like a tiger (side view), correcting the dog-like initial resemblance likely originating from a biased training distribution. The key contribution is a re-synthesis method that replaces photometric reconstruction for self-supervision.
  • Figure 2: Re-Depth Anything overview. Our main contribution is the re-lighting module that randomizes light conditions and shades the estimated geometry on the input. Notably, the re-lighting does not need to look physically accurate as we are only augmenting not photometrically reconstructing the image. Key is also the SDS optimization of embeddings and decoder, while leaving the encoder frozen.
  • Figure 3: Qualitative Comparison, highlighting the added detail (rows 1,2,3,6) and noise-removal effects (rows 4,5).
  • Figure 4: Qualitative ablation showing that optimizing depth directly or fine-tuning the whole network at once are detrimental. The listed error values relate to visual and quantitative improvements.
  • Figure 5: Limitations.
  • ...and 7 more figures