Table of Contents
Fetching ...

BoRe-Depth: Self-supervised Monocular Depth Estimation with Boundary Refinement for Embedded Systems

Chang Liu, Juan Li, Sheng Zhang, Chang Liu, Jie Li, Xu Zhang

TL;DR

BoRe-Depth tackles boundary blur in self-supervised monocular depth estimation for embedded systems by combining a lightweight MPViT encoder with an Enhanced Feature Adaptive Fusion (EFAF) decoder and a two-stage training regime that incorporates a semantic information loss. The method uses pseudo-depth labels for supervision and a frozen semantic encoder to transfer semantic knowledge, yielding sharper object boundaries while maintaining real-time performance (50.7 FPS) on NVIDIA Jetson Orin with 8.7M parameters. It achieves state-of-the-art boundary quality and depth accuracy on NYUv2 and KITTI, with strong zero-shot generalization on iBims-1, and is supported by comprehensive ablations validating each component. The practical impact lies in deploying accurate, boundary-aware depth maps on resource-constrained platforms for autonomous systems and AR/robotics applications.

Abstract

Depth estimation is one of the key technologies for realizing 3D perception in unmanned systems. Monocular depth estimation has been widely researched because of its low-cost advantage, but the existing methods face the challenges of poor depth estimation performance and blurred object boundaries on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate depth maps on embedded systems and significantly improves boundary quality. Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which adaptively fuses depth features to enhance boundary detail representation. Secondly, we integrate semantic knowledge into the encoder to improve the object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets, and we provide detailed ablation studies for the proposed methods. The code is available at https://github.com/liangxiansheng093/BoRe-Depth.

BoRe-Depth: Self-supervised Monocular Depth Estimation with Boundary Refinement for Embedded Systems

TL;DR

BoRe-Depth tackles boundary blur in self-supervised monocular depth estimation for embedded systems by combining a lightweight MPViT encoder with an Enhanced Feature Adaptive Fusion (EFAF) decoder and a two-stage training regime that incorporates a semantic information loss. The method uses pseudo-depth labels for supervision and a frozen semantic encoder to transfer semantic knowledge, yielding sharper object boundaries while maintaining real-time performance (50.7 FPS) on NVIDIA Jetson Orin with 8.7M parameters. It achieves state-of-the-art boundary quality and depth accuracy on NYUv2 and KITTI, with strong zero-shot generalization on iBims-1, and is supported by comprehensive ablations validating each component. The practical impact lies in deploying accurate, boundary-aware depth maps on resource-constrained platforms for autonomous systems and AR/robotics applications.

Abstract

Depth estimation is one of the key technologies for realizing 3D perception in unmanned systems. Monocular depth estimation has been widely researched because of its low-cost advantage, but the existing methods face the challenges of poor depth estimation performance and blurred object boundaries on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate depth maps on embedded systems and significantly improves boundary quality. Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which adaptively fuses depth features to enhance boundary detail representation. Secondly, we integrate semantic knowledge into the encoder to improve the object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets, and we provide detailed ablation studies for the proposed methods. The code is available at https://github.com/liangxiansheng093/BoRe-Depth.

Paper Structure

This paper contains 28 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The proposed BoRe-Depth is a lightweight model with boundary refinement capability. It can refine more accurate boundary details and improve the quality of the dense point cloud.
  • Figure 2: Overview of BoRe-Depth architecture. During training, the orange part represents $\textcolor{rgb(247,177,135)}{Semantic Segmentation Encoder}$ introduced in the second stage, which calculates semantic information loss through the differences between features. The blue part represents $\textcolor{rgb(189,215,238)}{DepthNet}$, which directly predicts the depth estimation result and calculates the boundary alignment loss through the pseudo-depth labels. The green part represents $\textcolor{rgb(169,209,143)}{PoseNet}$, which computes the camera pose between two frames. It warps the images to calculate the geometric consistency loss and view reconstruction loss.
  • Figure 3: DepthNet network architecture. (a) The overall architecture of the depth estimation network is presented. This network effectively extracts multi-scale features through the encoder-decoder structure and generates high-quality depth maps. (b) The EFAF module is demonstrated, which aggregates features at each level through lightweight convolution, thereby improving the boundary quality.
  • Figure 4: Qualitative indoor depth estimation results. Four images are respectively from NYUv2 dataset and IBims-1 dataset. Existing models are hard to describe the object boundaries quickly, which leads to blurred depth estimation. In contrast, our model predicts the most accurate depth with the clearest boundaries robustly.
  • Figure 5: Qualitative outdoor depth estimation results. Four images are from KITTI dataset. Obviously, our model has the best estimation accuracy as well as boundary quality.