Table of Contents
Fetching ...

Self-supervised Depth Denoising Using Lower- and Higher-quality RGB-D sensors

Akhmedkhan Shabanov, Ilya Krotov, Nikolay Chinaev, Vsevolod Poletaev, Sergei Kozlukov, Igor Pasechnik, Bulat Yakupov, Artsiom Sanakoyeu, Vadim Lebedev, Dmitry Ulyanov

TL;DR

This work tackles the problem of denoising depth from consumer-level RGB-D sensors by leveraging a higher-quality RGB-D source captured simultaneously but not time-synchronized. The authors develop a self-supervised pipeline that (i) aligns lower- and higher-quality sensors both temporally and spatially, (ii) trains a UNet-based depth denoiser followed by a ConvLSTM to exploit temporal context, and (iii) uses a simple L1 loss against reprojected HQ depth with pixel and segmentation masks. The approach yields a substantial improvement over state-of-the-art filtering and data-driven methods (e.g., achieving an average MSE as low as $21.02$ mm in their tests) and enables higher-quality 3D surface reconstruction for both dynamic and static body scenes, demonstrated with DoubleFusion and InfiniTAM/KinectFusion pipelines. By removing the need for precise ground-truth, hardware synchronization, or extensive calibration, this method makes dense, accurate depth reconstruction feasible for mobile or embedded depth sensors in real-world settings.

Abstract

Consumer-level depth cameras and depth sensors embedded in mobile devices enable numerous applications, such as AR games and face identification. However, the quality of the captured depth is sometimes insufficient for 3D reconstruction, tracking and other computer vision tasks. In this paper, we propose a self-supervised depth denoising approach to denoise and refine depth coming from a low quality sensor. We record simultaneous RGB-D sequences with unzynchronized lower- and higher-quality cameras and solve a challenging problem of aligning sequences both temporally and spatially. We then learn a deep neural network to denoise the lower-quality depth using the matched higher-quality data as a source of supervision signal. We experimentally validate our method against state-of-the-art filtering-based and deep denoising techniques and show its application for 3D object reconstruction tasks where our approach leads to more detailed fused surfaces and better tracking.

Self-supervised Depth Denoising Using Lower- and Higher-quality RGB-D sensors

TL;DR

This work tackles the problem of denoising depth from consumer-level RGB-D sensors by leveraging a higher-quality RGB-D source captured simultaneously but not time-synchronized. The authors develop a self-supervised pipeline that (i) aligns lower- and higher-quality sensors both temporally and spatially, (ii) trains a UNet-based depth denoiser followed by a ConvLSTM to exploit temporal context, and (iii) uses a simple L1 loss against reprojected HQ depth with pixel and segmentation masks. The approach yields a substantial improvement over state-of-the-art filtering and data-driven methods (e.g., achieving an average MSE as low as mm in their tests) and enables higher-quality 3D surface reconstruction for both dynamic and static body scenes, demonstrated with DoubleFusion and InfiniTAM/KinectFusion pipelines. By removing the need for precise ground-truth, hardware synchronization, or extensive calibration, this method makes dense, accurate depth reconstruction feasible for mobile or embedded depth sensors in real-world settings.

Abstract

Consumer-level depth cameras and depth sensors embedded in mobile devices enable numerous applications, such as AR games and face identification. However, the quality of the captured depth is sometimes insufficient for 3D reconstruction, tracking and other computer vision tasks. In this paper, we propose a self-supervised depth denoising approach to denoise and refine depth coming from a low quality sensor. We record simultaneous RGB-D sequences with unzynchronized lower- and higher-quality cameras and solve a challenging problem of aligning sequences both temporally and spatially. We then learn a deep neural network to denoise the lower-quality depth using the matched higher-quality data as a source of supervision signal. We experimentally validate our method against state-of-the-art filtering-based and deep denoising techniques and show its application for 3D object reconstruction tasks where our approach leads to more detailed fused surfaces and better tracking.

Paper Structure

This paper contains 15 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: We use lower- and higher-quality depth sensors (a) to record simultaneous RGB-D sequences. Our automatic data collection pipeline allows us to train a model for denoising the data coming from the lower-quality (LQ) sensor (b). In figure (c) we show a result produced by our method for the input (b).
  • Figure 2: Summary of the data processing pipeline. See \ref{['s:dataset']} for details.
  • Figure 3: Out-of-fold prediction scheme. We split all sequences into three groups $\{ P_1, P_2, P_{test} \}$. We then learn three models $M^1_1$, $M^1_2$, $M^1$ on the group shown in red and get their predictions on the blue parts. For example, the first model is trained on the part $P_1$ and evaluated on the part $P_2$. We then use the predictions $M^1_1(P_2)$ and $M^1_2(P_1)$ to train a second-level model $M^2$ while using $M^1(P_{test})$ for its validation.
  • Figure 4: Qualitative comparison of depth denoising methods. Please see a video in supplementary materials.
  • Figure 6: Canonical models built by DoubleFusion for different input data. DoubleFusion is designed to work with K2 input (e) and cannot handle noisy TD depth (a). Our method greatly improves the result of DoubleFusion (b, c) enabling its usage in mobile applications. We compare K2 mesh (e) with meshes (a, b c) by computing the distance between the surfaces and visualize the result as heatmaps. Blue color corresponds to zero error while red corresponds to 44mm error.
  • ...and 3 more figures