Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

Yara AlaaEldin; Francesca Odone

Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

Yara AlaaEldin, Francesca Odone

TL;DR

This work tackles real-time, onboard scene understanding for UAVs by jointly predicting depth and semantic segmentation from monocular aerial imagery. The authors propose Co-SemDepth, a joint architecture that shares an encoder between a depth pathway based on a M4Depth-like parallax mechanism and a semantic pathway inspired by M4Semantic, producing depth and semantic maps in real time. In experiments on the MidAir and Aeroscapes datasets, Co-SemDepth achieves competitive or superior accuracy compared to single-task and some joint baselines, while offering significantly lower inference time and memory usage (≈$49.6$ ms/frame and ≈6.2 GB on a Quadro P5000). The approach demonstrates the practicality of onboard, real-time, multi-task perception for UAV navigation, with ablations supporting a 5-level design and a modest loss weighting $w=0.1$ for balancing tasks.

Abstract

Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth

Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

TL;DR

Abstract

Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)