Table of Contents
Fetching ...

SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion

Xuan Zhu, Jijun Xiang, Xianqi Wang, Longliang Liu, Yu Wang, Hong Zhang, Fei Guo, Xin Yang

TL;DR

SVDC tackles the problem of sparse, noisy dToF depth video on mobile devices by fusing sparse ToF depth with RGB guidance across three-frame windows. It introduces Channel-Spatial Enhancement Attention (CSEA) to identify high-frequency edge regions and Adaptive Frequency Selective Fusion (AFSF) to fuse frames with region-aware kernel sizes, along with a cross-window temporal consistency loss to suppress flicker. The method achieves state-of-the-art accuracy and temporal consistency on the TartanAir and Dynamic Replica datasets with a lightweight model. This approach enables robust, edge-preserving depth completion for mobile 3D sensing and AR applications.

Abstract

Lightweight direct Time-of-Flight (dToF) sensors are ideal for 3D sensing on mobile devices. However, due to the manufacturing constraints of compact devices and the inherent physical principles of imaging, dToF depth maps are sparse and noisy. In this paper, we propose a novel video depth completion method, called SVDC, by fusing the sparse dToF data with the corresponding RGB guidance. Our method employs a multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the sparse dToF imaging. Misalignment between consecutive frames during multi-frame fusion could cause blending between object edges and the background, which results in a loss of detail. To address this, we introduce an adaptive frequency selective fusion (AFSF) module, which automatically selects convolution kernel sizes to fuse multi-frame features. Our AFSF utilizes a channel-spatial enhancement attention (CSEA) module to enhance features and generates an attention map as fusion weights. The AFSF ensures edge detail recovery while suppressing high-frequency noise in smooth regions. To further enhance temporal consistency, We propose a cross-window consistency loss to ensure consistent predictions across different windows, effectively reducing flickering. Our proposed SVDC achieves optimal accuracy and consistency on the TartanAir and Dynamic Replica datasets. Code is available at https://github.com/Lan1eve/SVDC.

SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion

TL;DR

SVDC tackles the problem of sparse, noisy dToF depth video on mobile devices by fusing sparse ToF depth with RGB guidance across three-frame windows. It introduces Channel-Spatial Enhancement Attention (CSEA) to identify high-frequency edge regions and Adaptive Frequency Selective Fusion (AFSF) to fuse frames with region-aware kernel sizes, along with a cross-window temporal consistency loss to suppress flicker. The method achieves state-of-the-art accuracy and temporal consistency on the TartanAir and Dynamic Replica datasets with a lightweight model. This approach enables robust, edge-preserving depth completion for mobile 3D sensing and AR applications.

Abstract

Lightweight direct Time-of-Flight (dToF) sensors are ideal for 3D sensing on mobile devices. However, due to the manufacturing constraints of compact devices and the inherent physical principles of imaging, dToF depth maps are sparse and noisy. In this paper, we propose a novel video depth completion method, called SVDC, by fusing the sparse dToF data with the corresponding RGB guidance. Our method employs a multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the sparse dToF imaging. Misalignment between consecutive frames during multi-frame fusion could cause blending between object edges and the background, which results in a loss of detail. To address this, we introduce an adaptive frequency selective fusion (AFSF) module, which automatically selects convolution kernel sizes to fuse multi-frame features. Our AFSF utilizes a channel-spatial enhancement attention (CSEA) module to enhance features and generates an attention map as fusion weights. The AFSF ensures edge detail recovery while suppressing high-frequency noise in smooth regions. To further enhance temporal consistency, We propose a cross-window consistency loss to ensure consistent predictions across different windows, effectively reducing flickering. Our proposed SVDC achieves optimal accuracy and consistency on the TartanAir and Dynamic Replica datasets. Code is available at https://github.com/Lan1eve/SVDC.

Paper Structure

This paper contains 23 sections, 12 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Comparisons with state-of-the-art (SOTA) methods on the TartanAir and Dynamic Replica datasets. Left: Accuracy metric RMSE$\downarrow$. Right: Temporal consistency metric OPW$\downarrow$wangLessMoreConsistent2022. Our proposed approach achieves superior accuracy and consistency compared to per-frame depth completion methods.
  • Figure 2: Overview of the proposed SVDC network. The CSEA module enhances multi-frame features and extracts attention maps to guide the AFSF module in selectively fusing multi-frame features. Finally, the low-resolution depth is obtained through the depth head and refined using the feature-guided pixel shuffle module to produce the final depth.
  • Figure 3: The proposed CSEA and AFSF architectures. Left: CSEA module. Right: AFSF module.
  • Figure 4: The supervision process of the Cross-Window Temporal Consistency Loss.
  • Figure 5: Qualitative results on TartanAir and Dynamic Replica. Row 1: Results on the TartanAir dataset. Row 2: Results on the Dynamic Replica dataset. The third column represents the attention maps extracted by the CSEA module. Our SVDC method outperforms DVDC in both edge prediction and the prediction of smooth regions.
  • ...and 7 more figures