Table of Contents
Fetching ...

Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

Zhengxue Wang, Zhiqiang Yan, Yuan Wu, Guangwei Gao, Xiang Li, Jian Yang

TL;DR

Real-world RGB-D capture often suffers from misalignment between RGB guidance and depth, limiting traditional RGB-guided depth super-resolution (DSR). The paper introduces MOMNet, an alignment-free framework that jointly performs zero-order, first-order, and second-order matching ($ ext{ZOM}$, $ ext{FOM}$, $ ext{SOM}$) to retrieve depth-relevant RGB information across a multi-order feature space, followed by a multi-order aggregation using structure detectors and a multi-order regularization to constrain learning in a multi-order space with $oxed{\, ext{Total loss}=oxed{\mathcal{L}_{rec}+ ext{L}_{grad}+ alpha ext{L}_{hes}}}$. Evaluated on Hypersim, DIML, and DyDToF, MOMNet delivers state-of-the-art robustness and accuracy across scales ($ imes4$, $ imes8$, $ imes16$) and includes a lightweight variant MOMNet-T that greatly reduces parameters while maintaining competitive performance. The approach advances alignment-free cross-modal fusion for depth enhancement, showing improved resilience to misalignment and noise and enabling practical deployment on consumer-grade sensors.

Abstract

Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.

Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

TL;DR

Real-world RGB-D capture often suffers from misalignment between RGB guidance and depth, limiting traditional RGB-guided depth super-resolution (DSR). The paper introduces MOMNet, an alignment-free framework that jointly performs zero-order, first-order, and second-order matching (, , ) to retrieve depth-relevant RGB information across a multi-order feature space, followed by a multi-order aggregation using structure detectors and a multi-order regularization to constrain learning in a multi-order space with . Evaluated on Hypersim, DIML, and DyDToF, MOMNet delivers state-of-the-art robustness and accuracy across scales (, , ) and includes a lightweight variant MOMNet-T that greatly reduces parameters while maintaining competitive performance. The approach advances alignment-free cross-modal fusion for depth enhancement, showing improved resilience to misalignment and noise and enabling practical deployment on consumer-grade sensors.

Abstract

Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.

Paper Structure

This paper contains 15 sections, 11 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Previous methods (a) are designed based on the assumption of spatial alignment between RGB and depth data. In contrast, our approach (b) focuses more on addressing the misalignment challenges present in real-world scenarios through multi-order matching, enabling alignment-free DSR.
  • Figure 2: Visualization of Gradient (Grad.) and Hessian (Hes.) maps for (a) RGB and (b) depth, respectively. (c) and (d) present their corresponding distribution comparisons.
  • Figure 3: Overview of MOMNet. Given LR depth $\boldsymbol D_{LR}$ and RGB $\boldsymbol I$ as inputs, we first encode them into features $\boldsymbol F_{d}^{0}$ and $\boldsymbol F_{r}^{0}$, respectively. Subsequently, the Multi-Order Matching and Aggregation (MOMA) module is iteratively performed to retrieve and aggregate depth-relevant information from misaligned RGB features, thereby predicting the HR depth $\boldsymbol D_{HR}$. Finally, both $\boldsymbol D_{HR}$ and the ground-truth (GT) depth $\boldsymbol D_{GT}$ are fed into the Multi-Order Regularization to optimize MOMNet.
  • Figure 4: Details of multi-order matching (left) and matching retrieval (MR, middle). Right: histogram comparison of (a) original RGB, (b) zero-order matched (ZOM) RGB, (c) first-order matched (FOM) RGB, and (d) second-order (SOM) matched RGB.
  • Figure 5: Details of multi-order aggregation. $\sigma$: Sigmoid Layer.
  • ...and 6 more figures