Table of Contents
Fetching ...

The Devil is in the Details: Boosting Guided Depth Super-Resolution via Rethinking Cross-Modal Alignment and Aggregation

Xinni Jiang, Zengsheng Kuang, Chunle Guo, Ruixun Zhang, Lei Cai, Xiao Fan, Chongyi Li

TL;DR

This work addresses guided depth super-resolution by tackling cross-modal misalignment and texture interference from RGB guidance. The authors introduce D2A2, a two-path approach featuring a Dynamic Dual Alignment (LDA + DGA) to align RGB and depth features both in distribution and geometry, followed by a Mask-to-Pixel Aggregation (GateConv + Pixel Attention) to filter and fuse cross-modal information. The method achieves state-of-the-art or competitive results across multiple benchmarks (NYUv2, Middlebury, Lu, RGBDD) and demonstrates strong generalization under varying conditions. The approach is notable for its explicit handling of modal and geometric differences and its selective, texture-aware fusion strategy, which reduces artifacts while preserving fine depth details, with potential impact on robotics, 3D reconstruction, and AR/VR applications.

Abstract

Guided depth super-resolution (GDSR) involves restoring missing depth details using the high-resolution RGB image of the same scene. Previous approaches have struggled with the heterogeneity and complementarity of the multi-modal inputs, and neglected the issues of modal misalignment, geometrical misalignment, and feature selection. In this study, we rethink some essential components in GDSR networks and propose a simple yet effective Dynamic Dual Alignment and Aggregation network (D2A2). D2A2 mainly consists of 1) a dynamic dual alignment module that adapts to alleviate the modal misalignment via a learnable domain alignment block and geometrically align cross-modal features by learning the offset; and 2) a mask-to-pixel feature aggregate module that uses the gated mechanism and pixel attention to filter out irrelevant texture noise from RGB features and combine the useful features with depth features. By combining the strengths of RGB and depth features while minimizing disturbance introduced by the RGB image, our method with simple reuse and redesign of basic components achieves state-of-the-art performance on multiple benchmark datasets. The code is available at https://github.com/JiangXinni/D2A2.

The Devil is in the Details: Boosting Guided Depth Super-Resolution via Rethinking Cross-Modal Alignment and Aggregation

TL;DR

This work addresses guided depth super-resolution by tackling cross-modal misalignment and texture interference from RGB guidance. The authors introduce D2A2, a two-path approach featuring a Dynamic Dual Alignment (LDA + DGA) to align RGB and depth features both in distribution and geometry, followed by a Mask-to-Pixel Aggregation (GateConv + Pixel Attention) to filter and fuse cross-modal information. The method achieves state-of-the-art or competitive results across multiple benchmarks (NYUv2, Middlebury, Lu, RGBDD) and demonstrates strong generalization under varying conditions. The approach is notable for its explicit handling of modal and geometric differences and its selective, texture-aware fusion strategy, which reduces artifacts while preserving fine depth details, with potential impact on robotics, 3D reconstruction, and AR/VR applications.

Abstract

Guided depth super-resolution (GDSR) involves restoring missing depth details using the high-resolution RGB image of the same scene. Previous approaches have struggled with the heterogeneity and complementarity of the multi-modal inputs, and neglected the issues of modal misalignment, geometrical misalignment, and feature selection. In this study, we rethink some essential components in GDSR networks and propose a simple yet effective Dynamic Dual Alignment and Aggregation network (D2A2). D2A2 mainly consists of 1) a dynamic dual alignment module that adapts to alleviate the modal misalignment via a learnable domain alignment block and geometrically align cross-modal features by learning the offset; and 2) a mask-to-pixel feature aggregate module that uses the gated mechanism and pixel attention to filter out irrelevant texture noise from RGB features and combine the useful features with depth features. By combining the strengths of RGB and depth features while minimizing disturbance introduced by the RGB image, our method with simple reuse and redesign of basic components achieves state-of-the-art performance on multiple benchmark datasets. The code is available at https://github.com/JiangXinni/D2A2.
Paper Structure (12 sections, 2 equations, 9 figures, 4 tables)

This paper contains 12 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Visual comparison between our D2A2 and the state-of-the-art methods on the Lu dataset Lu for $\times$8 depth super-resolution. We show the enlarged details in the red and green boxes. In contrast, our method can achieve sharper and clearer boundaries than the compared methods.
  • Figure 2: (a) is an overview of the proposed D2A2 network. The backbone of D2A2 is a multi-scale architecture with global skip-connection. At each scale, $F_d$ and the corresponding $F_{rgb}$ pass through DDA for modal and geometric alignment, then the aligned $F_{rgb}$ and $F_d$ pass through MFA to obtain the effective feature $F_{masked}$ and fuse it with $F_d$. (b) and (c) are the specific structures of DDA and MFA, respectively.
  • Figure 3: Visualization of RGB features before and after the dynamic dual alignment module (DDA) as well as the mask of the gated convolution (GC) and the weight map of pixel attention (PA) in the mask-to-pixel aggregation module.
  • Figure 4: The histogram of the input RGB feature and RGB feature after LDA with corresponding the histogram of depth feature. The distribution of the RGB feature after LDA is more inclined towards the depth feature.
  • Figure 5: Visual comparison of different methods on the Middlebury dataset for $\times$8 depth super-resolution.
  • ...and 4 more figures