Table of Contents
Fetching ...

SDformer: Efficient End-to-End Transformer for Depth Completion

Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang

TL;DR

A different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer), which obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.

Abstract

Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.

SDformer: Efficient End-to-End Transformer for Depth Completion

TL;DR

A different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer), which obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.

Abstract

Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.
Paper Structure (10 sections, 6 equations, 5 figures, 4 tables)

This paper contains 10 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The overall pipeline of SDformer for dense depth prediction.
  • Figure 2: The architecture of the SDformer block.
  • Figure 3: (a) Illustration of the computation of Different Window-based Multi-Scale Self-Attention. (b) Illustration of the computation of Gated Feed-Forward Network.
  • Figure 4: Qualitative comparison results with S2D, CSPN, and Kerlnet on the NYU Depth V2 test set. For better comparison and visualization, we apply the same heat map range with each scene and dilated the sparse depth map. We also highlight some regions for different methods.
  • Figure 5: Qualitative comparison results with CSPN, TWISE, SSGP and DDP on KITTI Depth Completion dataset. We highlighted some regions from different methods for better comparison and visualization.