Table of Contents
Fetching ...

Pyramid Feature Attention Network for Monocular Depth Prediction

Yifang Xu, Chenglei Peng, Ming Li, Yang Li, Sidan Du

TL;DR

This work designs a Dual-scale Channel Attention Module (DCAM) to employ channel attention in different scales, which aggregate global context and local information from the high-level feature maps, and introduces scale-invariant gradient loss to increase the penalty on errors in depth-wise discontinuous regions.

Abstract

Deep convolutional neural networks (DCNNs) have achieved great success in monocular depth estimation (MDE). However, few existing works take the contributions for MDE of different levels feature maps into account, leading to inaccurate spatial layout, ambiguous boundaries and discontinuous object surface in the prediction. To better tackle these problems, we propose a Pyramid Feature Attention Network (PFANet) to improve the high-level context features and low-level spatial features. In the proposed PFANet, we design a Dual-scale Channel Attention Module (DCAM) to employ channel attention in different scales, which aggregate global context and local information from the high-level feature maps. To exploit the spatial relationship of visual features, we design a Spatial Pyramid Attention Module (SPAM) which can guide the network attention to multi-scale detailed information in the low-level feature maps. Finally, we introduce scale-invariant gradient loss to increase the penalty on errors in depth-wise discontinuous regions. Experimental results show that our method outperforms state-of-the-art methods on the KITTI dataset.

Pyramid Feature Attention Network for Monocular Depth Prediction

TL;DR

This work designs a Dual-scale Channel Attention Module (DCAM) to employ channel attention in different scales, which aggregate global context and local information from the high-level feature maps, and introduces scale-invariant gradient loss to increase the penalty on errors in depth-wise discontinuous regions.

Abstract

Deep convolutional neural networks (DCNNs) have achieved great success in monocular depth estimation (MDE). However, few existing works take the contributions for MDE of different levels feature maps into account, leading to inaccurate spatial layout, ambiguous boundaries and discontinuous object surface in the prediction. To better tackle these problems, we propose a Pyramid Feature Attention Network (PFANet) to improve the high-level context features and low-level spatial features. In the proposed PFANet, we design a Dual-scale Channel Attention Module (DCAM) to employ channel attention in different scales, which aggregate global context and local information from the high-level feature maps. To exploit the spatial relationship of visual features, we design a Spatial Pyramid Attention Module (SPAM) which can guide the network attention to multi-scale detailed information in the low-level feature maps. Finally, we introduce scale-invariant gradient loss to increase the penalty on errors in depth-wise discontinuous regions. Experimental results show that our method outperforms state-of-the-art methods on the KITTI dataset.
Paper Structure (14 sections, 9 equations, 5 figures, 3 tables)

This paper contains 14 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Depth estimation example. (a) Input RGB image; (b) Ground truth depth; (c) Fu et al. DBLP:conf/cvpr/FuGWBT18/DORN; (d) Ours.
  • Figure 2: The overview of Pyramid Feature Attention Network. The network is composed of $E_{i}$ (the $i$-th level of encoder), Dense ASPP DBLP:conf/cvpr/YangYZLY18/DenseASPP, Dual-scale Channel Attention Module and Spatial Pyramid Attention Module. The high-level features are from $E_{3}$, $E_{4}$ and $E_{5}$. The low-level features are from $E_{1}$ and $E_{2}$.
  • Figure 3: The architecture of Dual-scale Channel Attention Module (DCAM). It consists of two blocks: local channel attention block and global channel attention block. The outputs of two blocks are fused to generate the channel attention map. Recalibration block is utilized to calibrate the channel attention map and further extract useful information for MDE. GAP denotes global average pooling layer. GMP denotes global max pooling layer.
  • Figure 4: The architecture of Spatial Pyramid Attention Module (SPAM). Ds/4 refers to /4 downsampling operation. Us×4 refers to ×4 upsampling operation. Spatial attention blocks learn the spatial attention map, these three maps form a pyramid structure.
  • Figure 5: Visualization of the different methods and our proposed method on KITTI dataset.