Table of Contents
Fetching ...

Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

Yara AlaaEldin, Francesca Odone

TL;DR

This work tackles real-time, onboard scene understanding for UAVs by jointly predicting depth and semantic segmentation from monocular aerial imagery. The authors propose Co-SemDepth, a joint architecture that shares an encoder between a depth pathway based on a M4Depth-like parallax mechanism and a semantic pathway inspired by M4Semantic, producing depth and semantic maps in real time. In experiments on the MidAir and Aeroscapes datasets, Co-SemDepth achieves competitive or superior accuracy compared to single-task and some joint baselines, while offering significantly lower inference time and memory usage (≈$49.6$ ms/frame and ≈6.2 GB on a Quadro P5000). The approach demonstrates the practicality of onboard, real-time, multi-task perception for UAV navigation, with ablations supporting a 5-level design and a modest loss weighting $w=0.1$ for balancing tasks.

Abstract

Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth

Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

TL;DR

This work tackles real-time, onboard scene understanding for UAVs by jointly predicting depth and semantic segmentation from monocular aerial imagery. The authors propose Co-SemDepth, a joint architecture that shares an encoder between a depth pathway based on a M4Depth-like parallax mechanism and a semantic pathway inspired by M4Semantic, producing depth and semantic maps in real time. In experiments on the MidAir and Aeroscapes datasets, Co-SemDepth achieves competitive or superior accuracy compared to single-task and some joint baselines, while offering significantly lower inference time and memory usage (≈ ms/frame and ≈6.2 GB on a Quadro P5000). The approach demonstrates the practicality of onboard, real-time, multi-task perception for UAV navigation, with ablations supporting a 5-level design and a modest loss weighting for balancing tasks.

Abstract

Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth

Paper Structure

This paper contains 17 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our proposed Co-SemDepth Architecture. It is composed of a shared encoder and two decoders. The encoder and the depth decoder are the same presented in m4depth. The semantic decoder makes use of the encoded feature maps to give an estimate of the semantic segmentation map. The depth and semantic maps get scaled up as they go forward through the successive levels of the decoders. The number of shown levels here are 3 while in our experiments we use 5 levels.
  • Figure 2: Our M4Semantic Architecture. It is composed of an encoder and a decoder module with a pyramidal structure. Each level of the decoder is composed of a preprocessing unit and a semantic refiner.
  • Figure 3: An illustration of the modules in our M4Semantic architecture. $N_c$ is the number of semantic classes
  • Figure 4: Qualitative evaluation of the semantic map predictions of ERFNet (3rd column), FCN MobileNet (4th column), and Co-SemDepth (5th column) on sample images from MidAir dataset.
  • Figure 5: Visulaization of the predicted semantic maps using M4Semantic on Aeroscapes dataset.