Table of Contents
Fetching ...

Real-time Monocular Depth Estimation on Embedded Systems

Cheng Feng, Congxuan Zhang, Zhen Chen, Weiming Hu, Liyue Ge

TL;DR

This paper tackles the challenge of real-time monocular depth estimation on resource-constrained embedded platforms. It proposes two lightweight encoder-decoder networks, RT-MonoDepth and RT-MonoDepth-S, featuring a simple 4-layer pyramid encoder and a streamlined depth decoder to minimize latency while preserving accuracy. On KITTI benchmarks, RT-MonoDepth achieves competitive depth accuracy with significantly fewer parameters and substantially higher FPS, including up to 18.4 FPS on Jetson Nano and 253.0 FPS on Jetson AGX Orin, while RT-MonoDepth-S reaches up to 364.1 FPS on Orin. The work demonstrates the fastest non-pruned monocular depth estimation on embedded devices, with ablation analyses guiding design choices, and highlights practical applicability for autonomous robots and aerial vehicles.

Abstract

Depth sensing is of paramount importance for unmanned aerial and autonomous vehicles. Nonetheless, contemporary monocular depth estimation methods employing complex deep neural networks within Convolutional Neural Networks are inadequately expedient for real-time inference on embedded platforms. This paper endeavors to surmount this challenge by proposing two efficient and lightweight architectures, RT-MonoDepth and RT-MonoDepth-S, thereby mitigating computational complexity and latency. Our methodologies not only attain accuracy comparable to prior depth estimation methods but also yield faster inference speeds. Specifically, RT-MonoDepth and RT-MonoDepth-S achieve frame rates of 18.4&30.5 FPS on NVIDIA Jetson Nano and 253.0&364.1 FPS on Jetson AGX Orin, utilizing a single RGB image of resolution 640x192. The experimental results underscore the superior accuracy and faster inference speed of our methods in comparison to existing fast monocular depth estimation methodologies on the KITTI dataset.

Real-time Monocular Depth Estimation on Embedded Systems

TL;DR

This paper tackles the challenge of real-time monocular depth estimation on resource-constrained embedded platforms. It proposes two lightweight encoder-decoder networks, RT-MonoDepth and RT-MonoDepth-S, featuring a simple 4-layer pyramid encoder and a streamlined depth decoder to minimize latency while preserving accuracy. On KITTI benchmarks, RT-MonoDepth achieves competitive depth accuracy with significantly fewer parameters and substantially higher FPS, including up to 18.4 FPS on Jetson Nano and 253.0 FPS on Jetson AGX Orin, while RT-MonoDepth-S reaches up to 364.1 FPS on Orin. The work demonstrates the fastest non-pruned monocular depth estimation on embedded devices, with ablation analyses guiding design choices, and highlights practical applicability for autonomous robots and aerial vehicles.

Abstract

Depth sensing is of paramount importance for unmanned aerial and autonomous vehicles. Nonetheless, contemporary monocular depth estimation methods employing complex deep neural networks within Convolutional Neural Networks are inadequately expedient for real-time inference on embedded platforms. This paper endeavors to surmount this challenge by proposing two efficient and lightweight architectures, RT-MonoDepth and RT-MonoDepth-S, thereby mitigating computational complexity and latency. Our methodologies not only attain accuracy comparable to prior depth estimation methods but also yield faster inference speeds. Specifically, RT-MonoDepth and RT-MonoDepth-S achieve frame rates of 18.4&30.5 FPS on NVIDIA Jetson Nano and 253.0&364.1 FPS on Jetson AGX Orin, utilizing a single RGB image of resolution 640x192. The experimental results underscore the superior accuracy and faster inference speed of our methods in comparison to existing fast monocular depth estimation methodologies on the KITTI dataset.
Paper Structure (13 sections, 5 figures, 3 tables)

This paper contains 13 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Accuracy (in $\delta_1$) vs. runtime (in FPS) on NVIDIA Jetson Nano, Xavier NX, and AGX Orin for various depth estimation algorithms on KITTI DBLP:journals/ijrr/GeigerLSU13 dataset using the Eigen split DBLP:conf/nips/EigenPF14. The top right represents the desired characteristics of a depth estimation network design: high inference speed and high accuracy. The data comes from published papers or is measured using official implementation (online test, $batchsize=1$).
  • Figure 2: Proposed RT-MonoDepth framework. The shape of input image is $H \times W \times 3$, symbol 'F' denotes feature maps in the pyramid encoder, symbol 'D' denotes predicted depth maps in each scale, the subscript number indicates their shape (e.g., $F_n\colon\frac{H}{2^n} \times \frac{W}{2^n} \times C_n$, $D_n\colon\frac{H}{2^n} \times \frac{W}{2^n} \times 1$). The Decoder$_3$, Decoder$_2$ and Decoder$_1$ can be removed when inferring.
  • Figure 3: Feature fusion methods of different approaches. (A) Monodepth2 DBLP:conf/iccv/GodardAFB19; (B) FastDepth DBLP:conf/icra/WofkMYKS19; (C) GuideDepth DBLP:conf/icra/0006DGNB22; (D) Our model.
  • Figure 4: Qualitative results of monocular depth estimation comparing RT-MonoDepth with Monodepth2 and GuideDepth on the KITTI dataset. Our method provides cleaner boundaries and more effective object reconstruction than the other methods.
  • Figure 5: Comparison results of inferencing speed comparing RT-MonoDepth with GuideDepth. Our method provides faster inferencing speed than the GuideDepth.