Table of Contents
Fetching ...

DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions

Yunxiao Shi, Manish Kumar Singh, Hong Cai, Fatih Porikli

TL;DR

This paper enhances a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections and proposes normalization techniques to process the point cloud, which improves learning and leads to better accuracy than directly using point transformers off the shelf.

Abstract

In this paper, we introduce a novel approach that harnesses both 2D and 3D attentions to enable highly accurate depth completion without requiring iterative spatial propagations. Specifically, we first enhance a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections. This effectively improves the performance of this simple network and sets it on par with the latest, complex transformer-based models. Leveraging the initial depths and features from this network, we uplift the 2D features to form a 3D point cloud and construct a 3D point transformer to process it, allowing the model to explicitly learn and exploit 3D geometric features. In addition, we propose normalization techniques to process the point cloud, which improves learning and leads to better accuracy than directly using point transformers off the shelf. Furthermore, we incorporate global attention on downsampled point cloud features, which enables long-range context while still being computationally feasible. We evaluate our method, DeCoTR, on established depth completion benchmarks, including NYU Depth V2 and KITTI, showcasing that it sets new state-of-the-art performance. We further conduct zero-shot evaluations on ScanNet and DDAD benchmarks and demonstrate that DeCoTR has superior generalizability compared to existing approaches.

DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions

TL;DR

This paper enhances a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections and proposes normalization techniques to process the point cloud, which improves learning and leads to better accuracy than directly using point transformers off the shelf.

Abstract

In this paper, we introduce a novel approach that harnesses both 2D and 3D attentions to enable highly accurate depth completion without requiring iterative spatial propagations. Specifically, we first enhance a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections. This effectively improves the performance of this simple network and sets it on par with the latest, complex transformer-based models. Leveraging the initial depths and features from this network, we uplift the 2D features to form a 3D point cloud and construct a 3D point transformer to process it, allowing the model to explicitly learn and exploit 3D geometric features. In addition, we propose normalization techniques to process the point cloud, which improves learning and leads to better accuracy than directly using point transformers off the shelf. Furthermore, we incorporate global attention on downsampled point cloud features, which enables long-range context while still being computationally feasible. We evaluate our method, DeCoTR, on established depth completion benchmarks, including NYU Depth V2 and KITTI, showcasing that it sets new state-of-the-art performance. We further conduct zero-shot evaluations on ScanNet and DDAD benchmarks and demonstrate that DeCoTR has superior generalizability compared to existing approaches.
Paper Structure (14 sections, 8 equations, 6 figures, 5 tables)

This paper contains 14 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Example depth completion results on NYU Depth v2 dataset silberman2012indoor. We first upgrade S2D (c) to S2D-TR with efficient attention on 2D features, which significantly improves the initial depth completion accuracy (d). Based on the more accurate initial depths, DeCoTR uplifts 2D features to form a 3D point cloud and leverages cross-attention on 3D points, which leads to highly accurate depth completion, with sharp details and close-to-GT quality (e). We highlight sample regions where we can clearly see progressively improving depths by using our proposed designs.
  • Figure 2: Overview of our proposed DeCoTR. The input RGB image and sparse depth map are first processed by our S2D-TR, which upgrades S2D with efficient 2D attentions. The learned 2D guidance features from S2D-TR are then uplifted to form a 3D feature point cloud based on the initial completed depth map. We normalize the point cloud and feed it through multiple 3D cross-attention layers (3D-TR) to enable geometry-aware feature learning and processing. We also introduce efficient global attention to capture long-range scene context. The attended 3D features from 3D-TR are projected back to 2D and given to a decoder to output the final completed depth map.
  • Figure 3: Qualitative results on NYUD-v2. We compare with SOTA methods such as NLSPN, GraphCSPN, and CompletionFormer. Areas where DeCoTR provides better depth accuracy are highlighted.
  • Figure 4: Qualitative results on KITTI DC. Areas where DeCoTR provides better depth accuracy are highlighted.
  • Figure 5: Qualitative results of zero-shot inference on ScanNet-v2. Areas where DeCoTR provide better depth accuracy are highlighted.
  • ...and 1 more figures