Table of Contents
Fetching ...

S2ML: Spatio-Spectral Mutual Learning for Depth Completion

Zihui Zhao, Yifei Zhang, Zheng Wang, Yang Li, Kui Jiang, Zihan Geng, Chia-Wen Lin

TL;DR

S2ML tackles depth completion by leveraging spatio-spectral mutual learning to exploit frequency-domain priors in raw depth images. It introduces a spectral fusion module that treats amplitude and phase spectra separately, and a spatial fusion module that combines frequency-domain features with local and global context via Swin-Convolution blocks. The method progressively refines depth predictions through cascaded spatio-spectral fusion pairs and a joint loss, achieving state-of-the-art performance on NYU-Depth v2 and SUN RGB-D with robust performance under RGB degradations and in outdoor-like scenarios. This approach offers a practical, efficient path to high-quality depth maps for downstream vision tasks by exploiting physical priors of depth invalid regions and the complementary information from RGB data.

Abstract

The raw depth images captured by RGB-D cameras using Time-of-Flight (TOF) or structured light often suffer from incomplete depth values due to weak reflections, boundary shadows, and artifacts, which limit their applications in downstream vision tasks. Existing methods address this problem through depth completion in the image domain, but they overlook the physical characteristics of raw depth images. It has been observed that the presence of invalid depth areas alters the frequency distribution pattern. In this work, we propose a Spatio-Spectral Mutual Learning framework (S2ML) to harmonize the advantages of both spatial and frequency domains for depth completion. Specifically, we consider the distinct properties of amplitude and phase spectra and devise a dedicated spectral fusion module. Meanwhile, the local and global correlations between spatial-domain and frequency-domain features are calculated in a unified embedding space. The gradual mutual representation and refinement encourage the network to fully explore complementary physical characteristics and priors for more accurate depth completion. Extensive experiments demonstrate the effectiveness of our proposed S2ML method, outperforming the state-of-the-art method CFormer by 0.828 dB and 0.834 dB on the NYU-Depth V2 and SUN RGB-D datasets, respectively.

S2ML: Spatio-Spectral Mutual Learning for Depth Completion

TL;DR

S2ML tackles depth completion by leveraging spatio-spectral mutual learning to exploit frequency-domain priors in raw depth images. It introduces a spectral fusion module that treats amplitude and phase spectra separately, and a spatial fusion module that combines frequency-domain features with local and global context via Swin-Convolution blocks. The method progressively refines depth predictions through cascaded spatio-spectral fusion pairs and a joint loss, achieving state-of-the-art performance on NYU-Depth v2 and SUN RGB-D with robust performance under RGB degradations and in outdoor-like scenarios. This approach offers a practical, efficient path to high-quality depth maps for downstream vision tasks by exploiting physical priors of depth invalid regions and the complementary information from RGB data.

Abstract

The raw depth images captured by RGB-D cameras using Time-of-Flight (TOF) or structured light often suffer from incomplete depth values due to weak reflections, boundary shadows, and artifacts, which limit their applications in downstream vision tasks. Existing methods address this problem through depth completion in the image domain, but they overlook the physical characteristics of raw depth images. It has been observed that the presence of invalid depth areas alters the frequency distribution pattern. In this work, we propose a Spatio-Spectral Mutual Learning framework (S2ML) to harmonize the advantages of both spatial and frequency domains for depth completion. Specifically, we consider the distinct properties of amplitude and phase spectra and devise a dedicated spectral fusion module. Meanwhile, the local and global correlations between spatial-domain and frequency-domain features are calculated in a unified embedding space. The gradual mutual representation and refinement encourage the network to fully explore complementary physical characteristics and priors for more accurate depth completion. Extensive experiments demonstrate the effectiveness of our proposed S2ML method, outperforming the state-of-the-art method CFormer by 0.828 dB and 0.834 dB on the NYU-Depth V2 and SUN RGB-D datasets, respectively.

Paper Structure

This paper contains 18 sections, 16 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Phase and amplitude spectra of the ground truth depth, raw depth, RGB image, and our predicted depth image.
  • Figure 2: Overview of our S2ML method. Given $D^\mathrm{raw}$ and $I$ as input, they are initially embedded into feature representations. These features undergo a recursive fusion process through a series of spatio-spectral fusion pairs. DFT and IDFT are conducted to enable information interaction between the frequency domain and the spatial domain. Subsequent to the frequency fusion, the fused frequency features are conveyed directly to the ensuing frequency fusion module and concurrently transformed into the spatial domain to be processed by the spatial fusion module.
  • Figure 3: Structure of our proposed frequency fusion module, involving the fusion of amplitude and phase spectra from two modalities through distinct fusion strategies. To underscore the differential information present within the frequency domain, the module extracts spectrum difference feature maps between the depth spectrum and the RGB spectrum, facilitated by a residual connection from the depth spectrum. This approach guides the network to prioritize spectral discrepancies during the fusion process.
  • Figure 4: Structure of our proposed image fusion module. This module combines convolutional layers and a Swin-Transformer architecture to extract both global and local features from the input images. The Swin-Transformer excels at capturing long-range dependencies, while convolutional layers handle local details. The details of window partitioning and merging within the Swin-Transformer are omitted for brevity.
  • Figure 5: Visualization of depth feature maps: (a) Ground truth, (b) Raw depth map, (c) Residual depth feature map of a single frequency fusion module, (d) Residual depth feature map of a spatio-spectral fusion pair. The invalid areas and corresponding residuals within the red rectangles highlight the contribution of each module to the depth completion process.
  • ...and 5 more figures