Table of Contents
Fetching ...

Revisiting Gradient-based Uncertainty for Monocular Depth Estimation

Julia Hornauer, Amir El-Ghoussani, Vasileios Belagiannis

TL;DR

This paper tackles the challenge of per-pixel uncertainty in monocular depth estimation by proposing a post hoc, training-free gradient-based approach. It introduces a reference depth $d_{ref}$ generated from augmented inputs via an invertible transform, and an auxiliary loss $\mathcal{L}_{aux}$ that drives backpropagation through the fixed depth estimator to produce gradient maps $g_i$ with respect to decoder features. Uncertainty is obtained as pixel-wise maps by processing these gradients, either from a single decoder layer or across multiple layers, with a normalised, layer-robust scoring scheme. The method achieves state-of-the-art uncertainty estimation on KITTI and NYU benchmarks for both convolutional and transformer-based models, without retraining, and code is publicly available; ablations validate the design choices and show robustness across architectures and augmentations.

Abstract

Monocular depth estimation, similar to other image-based tasks, is prone to erroneous predictions due to ambiguities in the image, for example, caused by dynamic objects or shadows. For this reason, pixel-wise uncertainty assessment is required for safety-critical applications to highlight the areas where the prediction is unreliable. We address this in a post hoc manner and introduce gradient-based uncertainty estimation for already trained depth estimation models. To extract gradients without depending on the ground truth depth, we introduce an auxiliary loss function based on the consistency of the predicted depth and a reference depth. The reference depth, which acts as pseudo ground truth, is in fact generated using a simple image or feature augmentation, making our approach simple and effective. To obtain the final uncertainty score, the derivatives w.r.t. the feature maps from single or multiple layers are calculated using back-propagation. We demonstrate that our gradient-based approach is effective in determining the uncertainty without re-training using the two standard depth estimation benchmarks KITTI and NYU. In particular, for models trained with monocular sequences and therefore most prone to uncertainty, our method outperforms related approaches. In addition, we publicly provide our code and models: https://github.com/jhornauer/GrUMoDepth

Revisiting Gradient-based Uncertainty for Monocular Depth Estimation

TL;DR

This paper tackles the challenge of per-pixel uncertainty in monocular depth estimation by proposing a post hoc, training-free gradient-based approach. It introduces a reference depth generated from augmented inputs via an invertible transform, and an auxiliary loss that drives backpropagation through the fixed depth estimator to produce gradient maps with respect to decoder features. Uncertainty is obtained as pixel-wise maps by processing these gradients, either from a single decoder layer or across multiple layers, with a normalised, layer-robust scoring scheme. The method achieves state-of-the-art uncertainty estimation on KITTI and NYU benchmarks for both convolutional and transformer-based models, without retraining, and code is publicly available; ablations validate the design choices and show robustness across architectures and augmentations.

Abstract

Monocular depth estimation, similar to other image-based tasks, is prone to erroneous predictions due to ambiguities in the image, for example, caused by dynamic objects or shadows. For this reason, pixel-wise uncertainty assessment is required for safety-critical applications to highlight the areas where the prediction is unreliable. We address this in a post hoc manner and introduce gradient-based uncertainty estimation for already trained depth estimation models. To extract gradients without depending on the ground truth depth, we introduce an auxiliary loss function based on the consistency of the predicted depth and a reference depth. The reference depth, which acts as pseudo ground truth, is in fact generated using a simple image or feature augmentation, making our approach simple and effective. To obtain the final uncertainty score, the derivatives w.r.t. the feature maps from single or multiple layers are calculated using back-propagation. We demonstrate that our gradient-based approach is effective in determining the uncertainty without re-training using the two standard depth estimation benchmarks KITTI and NYU. In particular, for models trained with monocular sequences and therefore most prone to uncertainty, our method outperforms related approaches. In addition, we publicly provide our code and models: https://github.com/jhornauer/GrUMoDepth

Paper Structure

This paper contains 34 sections, 14 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Example image from KITTI Geiger2013IJRR with corresponding depth prediction and uncertainty estimated with our gradient-based method. The uncertainty estimate demonstrates that the predicted depth is not reliable in the region where shadows and occlusions appear.
  • Figure 2: Overview of our gradient-based uncertainty estimation method with the horizontal flip as transformation $T(\cdot)$: First, we apply the transformation $T(\cdot)$ on the image $\mathbf{x}$ to obtain $\mathbf{x}^\prime$. Then, both images are passed through the depth estimation model to obtain the depth estimates $\mathbf{d}$ and $\mathbf{d}^\prime$, respectively. Since $T(\cdot)$ is an invertible geometric transformation, we apply the inverse transformation $T^{-1}(\cdot)$ to the depth estimate $\mathbf{d}^\prime$ to obtain the reference depth $\mathbf{d}_{ref}$. For the gradient extraction, the auxiliary loss $\mathcal{L}_{aux}(\mathbf{d}, \mathbf{d}_{ref})$ is back-propagated through the decoder to extract the gradient maps $\mathbf{g}_{i}$ at different decoder layers $i$. Either one specific layer or multiple layers can be chosen for the gradient extraction. The extracted gradient maps can then be used to calculate the respective uncertainty maps $\mathbf{u}_{i}$. Finally, the pixel-wise uncertainty score $\mathbf{u}$ is chosen to be the uncertainty map obtained from a single uncertainty map $\mathbf{u}_{i}$ or is calculated from $k$ uncertainty maps $\{\mathbf{u}_{i}\}^{k}$.
  • Figure 3: The sparsification error in terms of Absolute Relative Error (Abs Rel) over the fraction of remaining pixels is shown for Monodepth2 monodepth2 (MD) and MonoViT monovit (MV) trained with monocular supervision or stereo pair supervision on KITTI Geiger2013IJRR as well as Monodepth2 monodepth2 trained in a supervised manner on NYU. The model and methods are denoted as [model]:[method]. We compare our gradient-based uncertainty estimation approach to Post and Log applied to the regular depth estimation model (Reg-model) and the predictive depth estimation model (Log-model), respectively.
  • Figure 4: Uncertainty estimation example from Monodepth2 monodepth2 trained on NYU Depth V2 SilbermanECCV12. In (a), the input image is shown. (b) and (c) visualise the depth prediction and the error, respectively. (d) to (h) display the uncertainty estimates of the different methods.