Table of Contents
Fetching ...

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

Athanasios Tragakis, Chaitanya Kaul, Kevin J. Mitchell, Hang Dai, Roderick Murray-Smith, Daniele Faccio

TL;DR

This work tackles guided depth super-resolution by fusing low-resolution depth with high-resolution RGB guidance. It introduces Incremental Guided Attention Fusion (IGAF) and the Filtered Wide-Focus (FWF) feature extractor to perform cross-modal attention-driven fusion across multiple stages, minimizing RGB-induced artifacts. The approach achieves state-of-the-art results on NYU v2 for multiple upsampling factors and demonstrates strong zero-shot generalization to several datasets, supported by public code. Overall, IGAF provides a robust, generalizable solution for high-quality depth maps applicable to robotics, AR/VR, and medical imaging contexts.

Abstract

Accurate depth estimation is crucial for many fields, including robotics, navigation, and medical imaging. However, conventional depth sensors often produce low-resolution (LR) depth maps, making detailed scene perception challenging. To address this, enhancing LR depth maps to high-resolution (HR) ones has become essential, guided by HR-structured inputs like RGB or grayscale images. We propose a novel sensor fusion methodology for guided depth super-resolution (GDSR), a technique that combines LR depth maps with HR images to estimate detailed HR depth maps. Our key contribution is the Incremental guided attention fusion (IGAF) module, which effectively learns to fuse features from RGB images and LR depth maps, producing accurate HR depth maps. Using IGAF, we build a robust super-resolution model and evaluate it on multiple benchmark datasets. Our model achieves state-of-the-art results compared to all baseline models on the NYU v2 dataset for $\times 4$, $\times 8$, and $\times 16$ upsampling. It also outperforms all baselines in a zero-shot setting on the Middlebury, Lu, and RGB-D-D datasets. Code, environments, and models are available on GitHub.

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

TL;DR

This work tackles guided depth super-resolution by fusing low-resolution depth with high-resolution RGB guidance. It introduces Incremental Guided Attention Fusion (IGAF) and the Filtered Wide-Focus (FWF) feature extractor to perform cross-modal attention-driven fusion across multiple stages, minimizing RGB-induced artifacts. The approach achieves state-of-the-art results on NYU v2 for multiple upsampling factors and demonstrates strong zero-shot generalization to several datasets, supported by public code. Overall, IGAF provides a robust, generalizable solution for high-quality depth maps applicable to robotics, AR/VR, and medical imaging contexts.

Abstract

Accurate depth estimation is crucial for many fields, including robotics, navigation, and medical imaging. However, conventional depth sensors often produce low-resolution (LR) depth maps, making detailed scene perception challenging. To address this, enhancing LR depth maps to high-resolution (HR) ones has become essential, guided by HR-structured inputs like RGB or grayscale images. We propose a novel sensor fusion methodology for guided depth super-resolution (GDSR), a technique that combines LR depth maps with HR images to estimate detailed HR depth maps. Our key contribution is the Incremental guided attention fusion (IGAF) module, which effectively learns to fuse features from RGB images and LR depth maps, producing accurate HR depth maps. Using IGAF, we build a robust super-resolution model and evaluate it on multiple benchmark datasets. Our model achieves state-of-the-art results compared to all baseline models on the NYU v2 dataset for , , and upsampling. It also outperforms all baselines in a zero-shot setting on the Middlebury, Lu, and RGB-D-D datasets. Code, environments, and models are available on GitHub.
Paper Structure (10 sections, 8 equations, 5 figures, 7 tables)

This paper contains 10 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure S1: Overview of the proposed multi-modal architecture for the guided depth super resolution estimation.
  • Figure S2: The proposed multi-modal architecture utilizes information from both an LR depth map and an HR RGB image. Firstly, each modality passes through a convolutional layer followed by a LeakyReLU activation. The model utilizes the IGAF modules to combine information from the two modalities by fusing the relevant information on each stream and ignoring information that is unrelated to the depth maps. Finally, after the third IGAF module, the depth maps are refined and added using a global skip connection from the original upsampled LR depth maps. The RGB modality is used to provide guidance to estimate an HR depth map given an LR one.
  • Figure S3: The $\mathbf{IGAF}$ module. The module is responsible for both feature extraction and modality fusion. Each modality passes through a feature extraction stage $\mathbf{(FWF)}$ before the initial naive fusion by an element-wise multiplication. An $\mathbf{SAF}$ block follows, which fuses the result of the multiplication with the extracted features of the RGB stream creating an initial structural guidance. The second $\mathbf{SAF}$ block incrementally fuses this extracted structural guidance with the depth stream. The output of each $\mathbf{SAF}$ block is generated by learning attention weights and subsequently performing a cross-multiplication operation between the two input sequences, resulting in fused and salient processed information.
  • Figure S4: Overview of the $\mathbf{FWF}$ module. The two modules are separated and not combined into one larger module because the propagation of shallower features through the skip connections as seen in Figure \ref{['fig:igaf_module']} boosts the performance of the model. The $\mathbf{FE}$ module is a series of convolutional layers, a channel attention process, and two skip connections. The $\mathbf{WF}$ module uses linearly increasing dilation rates in convolutional layers to extract multi-resolution features.
  • Figure S5: Qualitative comparison between our model and SUFT suft. The visualizations shown are for the $\times 8$ case. Our model creates more complete depth maps as seen in (c) for rows 1 and 2. In (c), row 3 shows that our model creates sharper edges with minimal bleeding. Also, in (c), row 4 the proposed model creates less smoothing with less bleeding. (Colormap chosen for better visualization. Better seen in full-screen, with zoom-in options).