GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth
Aurélien Cecille, Stefan Duffner, Franck Davoine, Thibault Neveu, Rémi Agier
TL;DR
This work tackles the inherent scale ambiguity of self-supervised monocular depth by introducing GroCo, which leverages a ground-plane prior to recover metric depth without annotations. GroCo computes a ground depth $G$ from camera geometry and fuses it with the predicted depth through a learnable ground attention $\alpha$, yielding $D_i = (1-\alpha_i)\hat{D}_i + \alpha_i G_i$, and optimizes with scale-consistency and attention-regularization losses. The approach demonstrates improved scale recovery on KITTI, strong generalization to new camera configurations, and zero-shot transfer to the DDAD dataset, while remaining interpretable via ground-attention maps. Code is released to support practical deployment in autonomous systems.
Abstract
Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.
