Table of Contents
Fetching ...

Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation

Wang Boya, Wang Shuo, Ye Dong, Dou Ziwen

TL;DR

This work tackles the efficiency gap in self-supervised monocular depth estimation by replacing heavy encoders and long-range feature fusion with an EfficientNet-based encoder and a context-focused decoder. The core idea is to perform contextual feature fusion among adjacent scales and apply lightweight channel attention, discarding extensive long-range connections to reduce parameters. Key contributions include a contextual feature fusion mechanism, a multi-scale feature focus guide, and an overall low-parameter yet strong-performing model evaluated on KITTI. The results show competitive accuracy with significantly reduced computational cost, making the approach suitable for real-time robotics and autonomous systems; the authors also provide public code for reproduction.

Abstract

With the frequent use of self-supervised monocular depth estimation in robotics and autonomous driving, the model's efficiency is becoming increasingly important. Most current approaches apply much larger and more complex networks to improve the precision of depth estimation. Some researchers incorporated Transformer into self-supervised monocular depth estimation to achieve better performance. However, this method leads to high parameters and high computation. We present a fully convolutional depth estimation network using contextual feature fusion. Compared to UNet++ and HRNet, we use high-resolution and low-resolution features to reserve information on small targets and fast-moving objects instead of long-range fusion. We further promote depth estimation results employing lightweight channel attention based on convolution in the decoder stage. Our method reduces the parameters without sacrificing accuracy. Experiments on the KITTI benchmark show that our method can get better results than many large models, such as Monodepth2, with only 30 parameters. The source code is available at https://github.com/boyagesmile/DNA-Depth.

Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation

TL;DR

This work tackles the efficiency gap in self-supervised monocular depth estimation by replacing heavy encoders and long-range feature fusion with an EfficientNet-based encoder and a context-focused decoder. The core idea is to perform contextual feature fusion among adjacent scales and apply lightweight channel attention, discarding extensive long-range connections to reduce parameters. Key contributions include a contextual feature fusion mechanism, a multi-scale feature focus guide, and an overall low-parameter yet strong-performing model evaluated on KITTI. The results show competitive accuracy with significantly reduced computational cost, making the approach suitable for real-time robotics and autonomous systems; the authors also provide public code for reproduction.

Abstract

With the frequent use of self-supervised monocular depth estimation in robotics and autonomous driving, the model's efficiency is becoming increasingly important. Most current approaches apply much larger and more complex networks to improve the precision of depth estimation. Some researchers incorporated Transformer into self-supervised monocular depth estimation to achieve better performance. However, this method leads to high parameters and high computation. We present a fully convolutional depth estimation network using contextual feature fusion. Compared to UNet++ and HRNet, we use high-resolution and low-resolution features to reserve information on small targets and fast-moving objects instead of long-range fusion. We further promote depth estimation results employing lightweight channel attention based on convolution in the decoder stage. Our method reduces the parameters without sacrificing accuracy. Experiments on the KITTI benchmark show that our method can get better results than many large models, such as Monodepth2, with only 30 parameters. The source code is available at https://github.com/boyagesmile/DNA-Depth.
Paper Structure (10 sections, 10 equations, 3 figures, 1 table)

This paper contains 10 sections, 10 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Feature fusion. (a) FPNlin2017feature use top-down pathway to fuse multi-scale features,(b) PANetliu2018path add down-top pathway based on FPN, (c) BiFPNtan2019efficientnet prune some pathway based on PANet. (d) HRDepthlyu2021hr proposes a bottom-up feature fusion and adds skip connections,(e) DiffNet8962053 fuses the features extracted by the encoder using skip connections. (f) is our contextual feature fusion method to better fuse features of different scales.
  • Figure 2: The overview of the DNA-Depth network. EfficientNet is used as the encoder. The decoder employs contextual feature fusion and channel attention to get depth maps of different scales.
  • Figure 3: Depth estimation results visualization. The left column contains the input RGB images. The right column shows the result from DNA-Depth-B1; the remaining columns are from other contemporary methods. The results show that our network can predict small objects and moving objects with comparable performance to DIFFNet but far fewer parameters.