Table of Contents
Fetching ...

Gamma-from-Mono: Road-Relative, Metric, Self-Supervised Monocular Geometry for Vehicular Applications

Gasser Elazab, Maximilian Jansen, Michael Unterreiner, Olaf Hellwich

TL;DR

Gamma-from-Mono (GfM) tackles monocular road perception by predicting a global road-plane normal and a per-pixel gamma (height/depth ratio), turning a single image into metric, road-relative geometry with minimal calibration. The method leverages a self-supervised framework using neighboring frames and a planar homography to align the road plane and a residual parallax flow to capture non-planar details, while a probabilistic road mask focuses supervision on near-field regions. Through a carefully designed loss suite, depth and gamma are co-optimized, achieving competitive depth, state-of-the-art near-field gamma accuracy, and strong RS RD performance with only 8.88M parameters. The work demonstrates the practicality of gamma-space as a robust, interpretable representation for road topology, enabling more reliable vehicle planning and safety-aware control without heavy annotated data. Limitations include dependence on accurate camera height for metric scale and challenges with distant dynamic objects; future work envisions scaling gamma supervision to large driving data and integrating gamma-based geometry into planning pipelines.

Abstract

Accurate perception of the vehicle's 3D surroundings, including fine-scale road geometry, such as bumps, slopes, and surface irregularities, is essential for safe and comfortable vehicle control. However, conventional monocular depth estimation often oversmooths these features, losing critical information for motion planning and stability. To address this, we introduce Gamma-from-Mono (GfM), a lightweight monocular geometry estimation method that resolves the projective ambiguity in single-camera reconstruction by decoupling global and local structure. GfM predicts a dominant road surface plane together with residual variations expressed by gamma, a dimensionless measure of vertical deviation from the plane, defined as the ratio of a point's height above it to its depth from the camera, and grounded in established planar parallax geometry. With only the camera's height above ground, this representation deterministically recovers metric depth via a closed form, avoiding full extrinsic calibration and naturally prioritizing near-road detail. Its physically interpretable formulation makes it well suited for self-supervised learning, eliminating the need for large annotated datasets. Evaluated on KITTI and the Road Surface Reconstruction Dataset (RSRD), GfM achieves state-of-the-art near-field accuracy in both depth and gamma estimation while maintaining competitive global depth performance. Our lightweight 8.88M-parameter model adapts robustly across diverse camera setups and, to our knowledge, is the first self-supervised monocular approach evaluated on RSRD.

Gamma-from-Mono: Road-Relative, Metric, Self-Supervised Monocular Geometry for Vehicular Applications

TL;DR

Gamma-from-Mono (GfM) tackles monocular road perception by predicting a global road-plane normal and a per-pixel gamma (height/depth ratio), turning a single image into metric, road-relative geometry with minimal calibration. The method leverages a self-supervised framework using neighboring frames and a planar homography to align the road plane and a residual parallax flow to capture non-planar details, while a probabilistic road mask focuses supervision on near-field regions. Through a carefully designed loss suite, depth and gamma are co-optimized, achieving competitive depth, state-of-the-art near-field gamma accuracy, and strong RS RD performance with only 8.88M parameters. The work demonstrates the practicality of gamma-space as a robust, interpretable representation for road topology, enabling more reliable vehicle planning and safety-aware control without heavy annotated data. Limitations include dependence on accurate camera height for metric scale and challenges with distant dynamic objects; future work envisions scaling gamma supervision to large driving data and integrating gamma-based geometry into planning pipelines.

Abstract

Accurate perception of the vehicle's 3D surroundings, including fine-scale road geometry, such as bumps, slopes, and surface irregularities, is essential for safe and comfortable vehicle control. However, conventional monocular depth estimation often oversmooths these features, losing critical information for motion planning and stability. To address this, we introduce Gamma-from-Mono (GfM), a lightweight monocular geometry estimation method that resolves the projective ambiguity in single-camera reconstruction by decoupling global and local structure. GfM predicts a dominant road surface plane together with residual variations expressed by gamma, a dimensionless measure of vertical deviation from the plane, defined as the ratio of a point's height above it to its depth from the camera, and grounded in established planar parallax geometry. With only the camera's height above ground, this representation deterministically recovers metric depth via a closed form, avoiding full extrinsic calibration and naturally prioritizing near-road detail. Its physically interpretable formulation makes it well suited for self-supervised learning, eliminating the need for large annotated datasets. Evaluated on KITTI and the Road Surface Reconstruction Dataset (RSRD), GfM achieves state-of-the-art near-field accuracy in both depth and gamma estimation while maintaining competitive global depth performance. Our lightweight 8.88M-parameter model adapts robustly across diverse camera setups and, to our knowledge, is the first self-supervised monocular approach evaluated on RSRD.

Paper Structure

This paper contains 41 sections, 21 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Overview of our self-supervised model. (a) Predicts $\gamma$ (height/depth ratio) and road normal $\vec{N}_{\text{pred}}$. Height visualizations are clipped from 0 to 0.25 m. (b, c) 3D point clouds from KITTI geiger2012kitti and RSRD zhao2023rsrd, with input images in the top-left.
  • Figure 2: For points $(d,h)$ and $(2d,2h)$, $\Delta v=f\,h/d$ equals $\Delta v'$, the gap depends only on $h/d$. A known ground reference fixes metric scale, resolving monocular projective ambiguity.
  • Figure 3: Smoothing a small bump causes negligible depth error compared to a tree, but in $\gamma$-space the errors are similar, revealing sensitivity to small height changes. A numerical example is provided in the supplementary material \ref{['subsec:gamma_vs_depth']}.
  • Figure 4: Overview of our model architecture. The main network predicts a per-pixel parameter $\gamma$, while PoseNet estimates the relative pose between $I_{s}$ and $I_{t}$. From the relative pose, we compute a planar homography to align the planar road surface between the two images. Additionally, $\gamma$ is used to infer depth, scene height, and a probabilistic road mask, as detailed in \ref{['sec:postprocessing']}.
  • Figure 5: Qualitative comparison on KITTI and RSRD. Left: (a) KITTI geiger2012kitti input and (b) RSRD zhao2023rsrd input. Right: (a) KITTI example comparing GfM, GroCo GroCo2024, and DepthPro DepthPro2024 on $\gamma$, $\gamma$ error (Abs Diff), depth, and depth error (Abs Rel). (b) RSRD example comparing GfM, Monodepth2 MONODEPTH2, and DepthPro on $\gamma$ prediction, $\gamma$ error map (Abs Diff), and height predictions clipped to $[-0.5,0.5]$. Colormaps for each metric are shown in the last column.
  • ...and 10 more figures