Table of Contents
Fetching ...

Dusk Till Dawn: Self-supervised Nighttime Stereo Depth Estimation using Visual Foundation Models

Madhu Vankadari, Samuel Hodgson, Sangyun Shin, Kaichen Zhou Andrew Markham, Niki Trigoni

TL;DR

This work tackles nighttime self-supervised stereo depth estimation by leveraging visual foundation models to obtain robust, illumination-invariant features. A feature-level masking strategy and a distance regularizer improve depth accuracy in low-texture, poorly lit regions, while a cross-image transformer-based stereo matcher and RAFT-style upsampling deliver refined disparities. The authors introduce a comprehensive set of new evaluation metrics based on depth bins to better reflect nonuniform ground-truth depth, and demonstrate strong generalization on Oxford RobotCar and the MS2 nighttime sequences. Overall, the method achieves competitive performance against supervised baselines and shows robust depth estimation in challenging night scenes with minimal ground-truth supervision.

Abstract

Self-supervised depth estimation algorithms rely heavily on frame-warping relationships, exhibiting substantial performance degradation when applied in challenging circumstances, such as low-visibility and nighttime scenarios with varying illumination conditions. Addressing this challenge, we introduce an algorithm designed to achieve accurate self-supervised stereo depth estimation focusing on nighttime conditions. Specifically, we use pretrained visual foundation models to extract generalised features across challenging scenes and present an efficient method for matching and integrating these features from stereo frames. Moreover, to prevent pixels violating photometric consistency assumption from negatively affecting the depth predictions, we propose a novel masking approach designed to filter out such pixels. Lastly, addressing weaknesses in the evaluation of current depth estimation algorithms, we present novel evaluation metrics. Our experiments, conducted on challenging datasets including Oxford RobotCar and Multi-Spectral Stereo, demonstrate the robust improvements realized by our approach. Code is available at: https://github.com/madhubabuv/dtd

Dusk Till Dawn: Self-supervised Nighttime Stereo Depth Estimation using Visual Foundation Models

TL;DR

This work tackles nighttime self-supervised stereo depth estimation by leveraging visual foundation models to obtain robust, illumination-invariant features. A feature-level masking strategy and a distance regularizer improve depth accuracy in low-texture, poorly lit regions, while a cross-image transformer-based stereo matcher and RAFT-style upsampling deliver refined disparities. The authors introduce a comprehensive set of new evaluation metrics based on depth bins to better reflect nonuniform ground-truth depth, and demonstrate strong generalization on Oxford RobotCar and the MS2 nighttime sequences. Overall, the method achieves competitive performance against supervised baselines and shows robust depth estimation in challenging night scenes with minimal ground-truth supervision.

Abstract

Self-supervised depth estimation algorithms rely heavily on frame-warping relationships, exhibiting substantial performance degradation when applied in challenging circumstances, such as low-visibility and nighttime scenarios with varying illumination conditions. Addressing this challenge, we introduce an algorithm designed to achieve accurate self-supervised stereo depth estimation focusing on nighttime conditions. Specifically, we use pretrained visual foundation models to extract generalised features across challenging scenes and present an efficient method for matching and integrating these features from stereo frames. Moreover, to prevent pixels violating photometric consistency assumption from negatively affecting the depth predictions, we propose a novel masking approach designed to filter out such pixels. Lastly, addressing weaknesses in the evaluation of current depth estimation algorithms, we present novel evaluation metrics. Our experiments, conducted on challenging datasets including Oxford RobotCar and Multi-Spectral Stereo, demonstrate the robust improvements realized by our approach. Code is available at: https://github.com/madhubabuv/dtd
Paper Structure (20 sections, 7 equations, 6 figures, 3 tables)

This paper contains 20 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: A comparison of the estimated disparity using our method (Ours) with a SOTA stereo-matching method, IGEV-Stereo xu2023iterative. Note how the sky (blue rectangle) is incorrectly estimated by IGEV-Stereo as being very near. Similarly, there is a lack of detail showing the edge of the wall and the lamp-post (green rectangle). In comparison, our method is able to accurately estimate these depths.
  • Figure 2: Our approach consists of four main elements. Features are encoded independently for each input using DINO caron2021emerging_dino, a learnable projection head adapts these features and reduces their dimension, giving $f_{l}$ and $f_{r}$. Stereo matching of the features then takes place, with disparity filtering yielding the mask $\mathcal{M}$, and the combination of $f_{l}$ and $f_{r}$ providing the correspondence volume $\mathcal{C}_{disp}$. $\mathcal{W}_{disp}$ is found by using softmax on $\mathcal{C}_{disp}$, which is used to find coarse disparity $d_{c}$. Coarse disparity and the mask combine to give global disparity $d_{g}$, which is refined and upsampled to give final depth.
  • Figure 3: The qualitative comparison of the proposed method with SGM hernandez2016embedded and the state-of-the-art supervised methods Unimatch-Stereo xu2023unifying, IGEV-Stereo xu2023iterative, and Sharma et al. sharma2020nighttimestereo. The brighter the pixel is, the closer it is to the camera.
  • Figure 4: The visualization of (a) the ground truth depth distribution of the Robotcar test split, and (b) square relative error calculated at different depth-bins using the proposed weighted metric.
  • Figure 5: Visualization of the estimated masks in (c), with their input-images (Left camera) in (a), and the estimated disparity-maps in (b).
  • ...and 1 more figures