Table of Contents
Fetching ...

RTS-Mono: A Real-Time Self-Supervised Monocular Depth Estimation Method for Real-World Deployment

Zeyu Cheng, Tongfei Liu, Tao Lei, Xiang Hua, Yi Zhang, Chengkai Tang

TL;DR

RTS-Mono tackles the challenge of real-time, self-supervised monocular depth estimation on resource-limited platforms by pairing a lightweight Lite-Encoder with a multi-scale sparse fusion decoder. A self-supervised learning framework with a cross-scale depth consistency loss and edge-aware smoothness loss enables high depth fidelity while maintaining real-time speeds, demonstrated up to 49 FPS on Nvidia Jetson Orin. KITTI benchmarks show state-of-the-art performance among lightweight methods at both low and high resolutions, with substantial improvements in Abs Rel, Sq Rel, and RMSE. Real-world UAV deployments validate the practical applicability, confirming the method’s balance between computational efficiency and depth accuracy for real-world perception tasks.

Abstract

Depth information is crucial for autonomous driving and intelligent robot navigation. The simplicity and flexibility of self-supervised monocular depth estimation are conducive to its role in these fields. However, most existing monocular depth estimation models consume many computing resources. Although some methods have reduced the model's size and improved computing efficiency, the performance deteriorates, seriously hindering the real-world deployment of self-supervised monocular depth estimation models in the real world. To address this problem, we proposed a real-time self-supervised monocular depth estimation method and implemented it in the real world. It is called RTS-Mono, which is a lightweight and efficient encoder-decoder architecture. The encoder is based on Lite-Encoder, and the decoder is designed with a multi-scale sparse fusion framework to minimize redundancy, ensure performance, and improve inference speed. RTS-Mono achieved state-of-the-art (SoTA) performance in high and low resolutions with extremely low parameter counts (3 M) in experiments based on the KITTI dataset. Compared with lightweight methods, RTS-Mono improved Abs Rel and Sq Rel by 5.6% and 9.8% at low resolution and improved Sq Rel and RMSE by 6.1% and 1.9% at high resolution. In real-world deployment experiments, RTS-Mono has extremely high accuracy and can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS. Source code is available at https://github.com/ZYCheng777/RTS-Mono.

RTS-Mono: A Real-Time Self-Supervised Monocular Depth Estimation Method for Real-World Deployment

TL;DR

RTS-Mono tackles the challenge of real-time, self-supervised monocular depth estimation on resource-limited platforms by pairing a lightweight Lite-Encoder with a multi-scale sparse fusion decoder. A self-supervised learning framework with a cross-scale depth consistency loss and edge-aware smoothness loss enables high depth fidelity while maintaining real-time speeds, demonstrated up to 49 FPS on Nvidia Jetson Orin. KITTI benchmarks show state-of-the-art performance among lightweight methods at both low and high resolutions, with substantial improvements in Abs Rel, Sq Rel, and RMSE. Real-world UAV deployments validate the practical applicability, confirming the method’s balance between computational efficiency and depth accuracy for real-world perception tasks.

Abstract

Depth information is crucial for autonomous driving and intelligent robot navigation. The simplicity and flexibility of self-supervised monocular depth estimation are conducive to its role in these fields. However, most existing monocular depth estimation models consume many computing resources. Although some methods have reduced the model's size and improved computing efficiency, the performance deteriorates, seriously hindering the real-world deployment of self-supervised monocular depth estimation models in the real world. To address this problem, we proposed a real-time self-supervised monocular depth estimation method and implemented it in the real world. It is called RTS-Mono, which is a lightweight and efficient encoder-decoder architecture. The encoder is based on Lite-Encoder, and the decoder is designed with a multi-scale sparse fusion framework to minimize redundancy, ensure performance, and improve inference speed. RTS-Mono achieved state-of-the-art (SoTA) performance in high and low resolutions with extremely low parameter counts (3 M) in experiments based on the KITTI dataset. Compared with lightweight methods, RTS-Mono improved Abs Rel and Sq Rel by 5.6% and 9.8% at low resolution and improved Sq Rel and RMSE by 6.1% and 1.9% at high resolution. In real-world deployment experiments, RTS-Mono has extremely high accuracy and can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS. Source code is available at https://github.com/ZYCheng777/RTS-Mono.

Paper Structure

This paper contains 21 sections, 10 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparison of the results of some lightweight self-supervised monocular depth estimation networks in real-world deployments.
  • Figure 2: Network architecture overview.
  • Figure 3: Multi-scale sparse fusion upsampling module.
  • Figure 4: Fusion block.
  • Figure 5: Decoder structure comparison.
  • ...and 5 more figures