Table of Contents
Fetching ...

GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

Aurélien Cecille, Stefan Duffner, Franck Davoine, Thibault Neveu, Rémi Agier

TL;DR

This work tackles the inherent scale ambiguity of self-supervised monocular depth by introducing GroCo, which leverages a ground-plane prior to recover metric depth without annotations. GroCo computes a ground depth $G$ from camera geometry and fuses it with the predicted depth through a learnable ground attention $\alpha$, yielding $D_i = (1-\alpha_i)\hat{D}_i + \alpha_i G_i$, and optimizes with scale-consistency and attention-regularization losses. The approach demonstrates improved scale recovery on KITTI, strong generalization to new camera configurations, and zero-shot transfer to the DDAD dataset, while remaining interpretable via ground-attention maps. Code is released to support practical deployment in autonomous systems.

Abstract

Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.

GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth

TL;DR

This work tackles the inherent scale ambiguity of self-supervised monocular depth by introducing GroCo, which leverages a ground-plane prior to recover metric depth without annotations. GroCo computes a ground depth from camera geometry and fuses it with the predicted depth through a learnable ground attention , yielding , and optimizes with scale-consistency and attention-regularization losses. The approach demonstrates improved scale recovery on KITTI, strong generalization to new camera configurations, and zero-shot transfer to the DDAD dataset, while remaining interpretable via ground-attention maps. Code is released to support practical deployment in autonomous systems.

Abstract

Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.
Paper Structure (18 sections, 7 equations, 9 figures, 5 tables)

This paper contains 18 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Example of the models' depth and ground attention prediction. The ground depth is given as input and integrated in the depth prediction using the attention map.
  • Figure 2: Result of attention maps compared to GeDepth yang_gedepth_2023. While our method outputs very certain and precise ground segmentation, we see that GeDepth tends to have higher recall and uncertainty. We note that although Gedepth attention maps often consider the bottom part of obstacles as ground, it does not impact the end performance because these parts can be compensated by the residual depth or the slope prediction. It also underlines the fact that their adaptive (A) version relies much more on the ground prior compared to the vanilla (V) one, potentially improving robustness.
  • Figure 3: Illustration of the model architecture, highlighting the integration of ground depth information. The input image and ground depth are concatenated to provide ground aware features. The ground attention mechanism combines the depth map with the ground depth, guided by the attention map, to produce a refined final depth estimation.
  • Figure 4: Overview of the proposed ground constraint loss $\mathcal{L}_{\mathit{const}}$ and attention regularisation $\mathcal{L}_{\mathit{reg}}$. The error image in $\mathcal{L}_{\mathit{const}}$ illustrates how this loss penalizes disagreement between the depth map and ground depth, indirectly ensuring that the scale of depth converges to the one of the ground.
  • Figure 5: Visualisation of predictions on the same image with the various rotation augmentations. The last column is the relative per pixel error with the ground truth. The error is between -20% in red and 20% in blue, with 0% or absence of ground truth in white.
  • ...and 4 more figures