Table of Contents
Fetching ...

TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs

Horatiu Florea, Sergiu Nedevschi

TL;DR

TanDepth addresses the scale ambiguity of monocular depth estimation for UAVs by leveraging TanDEM-X Global Digital Elevation Model data as anchor points projected into the image, enabling metric depth recovery for SSI and non-SSI models. The method combines occlusion-aware GDEM projection, an adapted Cloth Simulation Filter for ground segmentation, and least-squares scaling in disparity space to produce metric depth without additional training. It demonstrates robust performance across diverse outdoor UAV scenes and introduces UAVid-3D-Scenes, a depth-focused extension to UAVid, to support ongoing research. The approach reduces the dependency on large labeled depth datasets and offers a practical path toward real-world metric-depth deployment in aerial perception tasks.

Abstract

Aerial scene understanding systems face stringent payload restrictions and must often rely on monocular depth estimation for modeling scene geometry, which is an inherently ill-posed problem. Moreover, obtaining accurate ground truth data required by learning-based methods raises significant additional challenges in the aerial domain. Self-supervised approaches can bypass this problem, at the cost of providing only up-to-scale results. Similarly, recent supervised solutions which make good progress towards zero-shot generalization also provide only relative depth values. This work presents TanDepth, a practical scale recovery method for obtaining metric depth results from relative estimations at inference-time, irrespective of the type of model generating them. Tailored for Unmanned Aerial Vehicle (UAV) applications, our method leverages sparse measurements from Global Digital Elevation Models (GDEM) by projecting them to the camera view using extrinsic and intrinsic information. An adaptation to the Cloth Simulation Filter is presented, which allows selecting ground points from the estimated depth map to then correlate with the projected reference points. We evaluate and compare our method against alternate scaling methods adapted for UAVs, on a variety of real-world scenes. Considering the limited availability of data for this domain, we construct and release a comprehensive, depth-focused extension to the popular UAVid dataset to further research.

TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs

TL;DR

TanDepth addresses the scale ambiguity of monocular depth estimation for UAVs by leveraging TanDEM-X Global Digital Elevation Model data as anchor points projected into the image, enabling metric depth recovery for SSI and non-SSI models. The method combines occlusion-aware GDEM projection, an adapted Cloth Simulation Filter for ground segmentation, and least-squares scaling in disparity space to produce metric depth without additional training. It demonstrates robust performance across diverse outdoor UAV scenes and introduces UAVid-3D-Scenes, a depth-focused extension to UAVid, to support ongoing research. The approach reduces the dependency on large labeled depth datasets and offers a practical path toward real-world metric-depth deployment in aerial perception tasks.

Abstract

Aerial scene understanding systems face stringent payload restrictions and must often rely on monocular depth estimation for modeling scene geometry, which is an inherently ill-posed problem. Moreover, obtaining accurate ground truth data required by learning-based methods raises significant additional challenges in the aerial domain. Self-supervised approaches can bypass this problem, at the cost of providing only up-to-scale results. Similarly, recent supervised solutions which make good progress towards zero-shot generalization also provide only relative depth values. This work presents TanDepth, a practical scale recovery method for obtaining metric depth results from relative estimations at inference-time, irrespective of the type of model generating them. Tailored for Unmanned Aerial Vehicle (UAV) applications, our method leverages sparse measurements from Global Digital Elevation Models (GDEM) by projecting them to the camera view using extrinsic and intrinsic information. An adaptation to the Cloth Simulation Filter is presented, which allows selecting ground points from the estimated depth map to then correlate with the projected reference points. We evaluate and compare our method against alternate scaling methods adapted for UAVs, on a variety of real-world scenes. Considering the limited availability of data for this domain, we construct and release a comprehensive, depth-focused extension to the popular UAVid dataset to further research.
Paper Structure (16 sections, 3 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 3 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: TanDepth enables relative depth maps (left window) used by UAVs for estimating scene geometry to be scaled using sparse points (green) part of a satellite-based Global Digital Elevation Model, yielding metric depth results
  • Figure 2: TanDepth processing flow. RGB, pose, intrinsics and GDEM inputs are processed to generate a sparse metric ground map which is used to scale a relative depth map estimated by a MDE model. The unscaled depth is used to generate a ground segmentation that masks out any GDEM points projected from other surfaces.
  • Figure 3: TanDEM-X GDEM Projection. Example of GDEM points projected to the image. Left frame shows points from the densified GDEM, right shows raw points from TanDEM-X; in both cases, non-ground points are masked out
  • Figure 4: CSF Ground Segmentation. Results of the ground segmentation based on adapted CSF, overlayed as a turquoise layer over input images from Chilia, Germany and Oveselu scenes, respectively.
  • Figure 5: UAVid-3D-Scenes. (a) Examples of the dense (left) and sparse (right) depth reconstructions for UAVid scenes recorded in Germany and in China, respectively. (b) Distribution of key metrics in UAVid-3D-Scenes: Coverage (% of reconstructed pixels in a frame) of the sparse depth maps, AGL flight height and the per frame minimum, median and maximum depths across the entire dataset.
  • ...and 2 more figures