Gated Cross-Attention Network for Depth Completion

Xiaogang Jia; Songlei Jian; Yusong Tan; Yonggang Che; Wei Chen; Zhengfa Liang

Gated Cross-Attention Network for Depth Completion

Xiaogang Jia, Songlei Jian, Yusong Tan, Yonggang Che, Wei Chen, Zhengfa Liang

TL;DR

This work tackles depth completion under the challenge of asymmetric color-depth information by introducing a Gated Cross-Attention Network that performs local fusion via a gating mechanism and global fusion via a low-resolution Transformer. It enables mutual supervision between RGB and depth features through co-attention and confidence propagation, reducing reliance on additional confidence branches. Hyperparameter optimization with Ray Tune (AsyncHyperBandScheduler + HyperOptSearch) automates the search for optimal per-scale iteration counts, yielding Pareto-optimal speed-accuracy trade-offs and state-of-the-art KITTI results. The approach demonstrates strong generalization across indoor and outdoor datasets and offers a practical pathway to real-time, high-precision depth completion.

Abstract

Depth completion is a popular research direction in the field of depth estimation. The fusion of color and depth features is the current critical challenge in this task, mainly due to the asymmetry between the rich scene details in color images and the sparse pixels in depth maps. To tackle this issue, we design an efficient Gated Cross-Attention Network that propagates confidence via a gating mechanism, simultaneously extracting and refining key information in both color and depth branches to achieve local spatial feature fusion. Additionally, we employ an attention network based on the Transformer in low-dimensional space to effectively fuse global features and increase the network's receptive field. With a simple yet efficient gating mechanism, our proposed method achieves fast and accurate depth completion without the need for additional branches or post-processing steps. At the same time, we use the Ray Tune mechanism with the AsyncHyperBandScheduler scheduler and the HyperOptSearch algorithm to automatically search for the optimal number of module iterations, which also allows us to achieve performance comparable to state-of-the-art methods. We conduct experiments on both indoor and outdoor scene datasets. Our fast network achieves Pareto-optimal solutions in terms of time and accuracy, and at the time of submission, our accurate network ranks first among all published papers on the KITTI official website in terms of accuracy.

Gated Cross-Attention Network for Depth Completion

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Overly rich color image features
Overly sparse depth features
Search for Fusion Counts
Related Work
Concatenation and Channel Shuffle
Confidence (Mask) Propagation
Attention Mechanism
Our Method
Dual-Encoder Single-Decoder Backbone Network
Gated Cross-Attention Mechanism
Transformer Global Attention Mechanism
Iteration Count Search
Loss function
Experiments
...and 7 more sections

Figures (6)

Figure 1: The challenges and obstacles in depth completion. The color image (a) contains overly complex detail features, which can cause methods that rely solely on RGB for generating guiding features to easily introduce irrelevant and incorrect information (d). The input depth map (c) is characterized by its excessively sparse features, often leading to reliance on confidence (mask) for incremental filling and optimization (e), thereby increasing the need for additional branches and computational expense. To address this, we devise a novel Gated Cross-Attention mechanism (f) that merges mask and depth information by propagating confidence features. It also employs co-attention to correct irrelevant details in the color features and to complete the depth features.
Figure 2: The overall architecture of our proposed network. RGB images and sparse depth maps are fed into a dual-branch encoder to extract features separately. The RGB branch generates guiding features, while the depth branch produces probability features for confidence propagation. Subsequently, local pixel fusion is achieved at high resolutions through the gating mechanism, and global fusion is realized at low resolutions through the Transformer, increasing the receptive field. Features from both branches are mutually corrected and completed under the regulation of the co-attention mechanism. The number of fusion iterations is determined through progressive search using Ray Tune. The fused features are gradually upsampled with the depth branch to the original resolution to generate the final dense and precise depth map.
Figure 3: The Transformer's global attention mechanism is implemented by executing self-attention computations at low resolution for global perception and using the multilayer perceptron to further enhance the features.
Figure 4: Schematic diagram of the iteration count search strategy. We employ automated machine learning to find the optimal solution from the vast search space.
Figure 5: Comparative visualization results for official test images from the KITTI Depth Completion benchmark. We select some of the latest representative methods for comparison, including Baseline, Late Fusion, and Spatial Propagation Networks, etc.
...and 1 more figures

Gated Cross-Attention Network for Depth Completion

TL;DR

Abstract

Gated Cross-Attention Network for Depth Completion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)