The Devil is in the Details: Boosting Guided Depth Super-Resolution via Rethinking Cross-Modal Alignment and Aggregation
Xinni Jiang, Zengsheng Kuang, Chunle Guo, Ruixun Zhang, Lei Cai, Xiao Fan, Chongyi Li
TL;DR
This work addresses guided depth super-resolution by tackling cross-modal misalignment and texture interference from RGB guidance. The authors introduce D2A2, a two-path approach featuring a Dynamic Dual Alignment (LDA + DGA) to align RGB and depth features both in distribution and geometry, followed by a Mask-to-Pixel Aggregation (GateConv + Pixel Attention) to filter and fuse cross-modal information. The method achieves state-of-the-art or competitive results across multiple benchmarks (NYUv2, Middlebury, Lu, RGBDD) and demonstrates strong generalization under varying conditions. The approach is notable for its explicit handling of modal and geometric differences and its selective, texture-aware fusion strategy, which reduces artifacts while preserving fine depth details, with potential impact on robotics, 3D reconstruction, and AR/VR applications.
Abstract
Guided depth super-resolution (GDSR) involves restoring missing depth details using the high-resolution RGB image of the same scene. Previous approaches have struggled with the heterogeneity and complementarity of the multi-modal inputs, and neglected the issues of modal misalignment, geometrical misalignment, and feature selection. In this study, we rethink some essential components in GDSR networks and propose a simple yet effective Dynamic Dual Alignment and Aggregation network (D2A2). D2A2 mainly consists of 1) a dynamic dual alignment module that adapts to alleviate the modal misalignment via a learnable domain alignment block and geometrically align cross-modal features by learning the offset; and 2) a mask-to-pixel feature aggregate module that uses the gated mechanism and pixel attention to filter out irrelevant texture noise from RGB features and combine the useful features with depth features. By combining the strengths of RGB and depth features while minimizing disturbance introduced by the RGB image, our method with simple reuse and redesign of basic components achieves state-of-the-art performance on multiple benchmark datasets. The code is available at https://github.com/JiangXinni/D2A2.
