Real-time Monocular Depth Estimation on Embedded Systems
Cheng Feng, Congxuan Zhang, Zhen Chen, Weiming Hu, Liyue Ge
TL;DR
This paper tackles the challenge of real-time monocular depth estimation on resource-constrained embedded platforms. It proposes two lightweight encoder-decoder networks, RT-MonoDepth and RT-MonoDepth-S, featuring a simple 4-layer pyramid encoder and a streamlined depth decoder to minimize latency while preserving accuracy. On KITTI benchmarks, RT-MonoDepth achieves competitive depth accuracy with significantly fewer parameters and substantially higher FPS, including up to 18.4 FPS on Jetson Nano and 253.0 FPS on Jetson AGX Orin, while RT-MonoDepth-S reaches up to 364.1 FPS on Orin. The work demonstrates the fastest non-pruned monocular depth estimation on embedded devices, with ablation analyses guiding design choices, and highlights practical applicability for autonomous robots and aerial vehicles.
Abstract
Depth sensing is of paramount importance for unmanned aerial and autonomous vehicles. Nonetheless, contemporary monocular depth estimation methods employing complex deep neural networks within Convolutional Neural Networks are inadequately expedient for real-time inference on embedded platforms. This paper endeavors to surmount this challenge by proposing two efficient and lightweight architectures, RT-MonoDepth and RT-MonoDepth-S, thereby mitigating computational complexity and latency. Our methodologies not only attain accuracy comparable to prior depth estimation methods but also yield faster inference speeds. Specifically, RT-MonoDepth and RT-MonoDepth-S achieve frame rates of 18.4&30.5 FPS on NVIDIA Jetson Nano and 253.0&364.1 FPS on Jetson AGX Orin, utilizing a single RGB image of resolution 640x192. The experimental results underscore the superior accuracy and faster inference speed of our methods in comparison to existing fast monocular depth estimation methodologies on the KITTI dataset.
