Self-supervised Depth Denoising Using Lower- and Higher-quality RGB-D sensors
Akhmedkhan Shabanov, Ilya Krotov, Nikolay Chinaev, Vsevolod Poletaev, Sergei Kozlukov, Igor Pasechnik, Bulat Yakupov, Artsiom Sanakoyeu, Vadim Lebedev, Dmitry Ulyanov
TL;DR
This work tackles the problem of denoising depth from consumer-level RGB-D sensors by leveraging a higher-quality RGB-D source captured simultaneously but not time-synchronized. The authors develop a self-supervised pipeline that (i) aligns lower- and higher-quality sensors both temporally and spatially, (ii) trains a UNet-based depth denoiser followed by a ConvLSTM to exploit temporal context, and (iii) uses a simple L1 loss against reprojected HQ depth with pixel and segmentation masks. The approach yields a substantial improvement over state-of-the-art filtering and data-driven methods (e.g., achieving an average MSE as low as $21.02$ mm in their tests) and enables higher-quality 3D surface reconstruction for both dynamic and static body scenes, demonstrated with DoubleFusion and InfiniTAM/KinectFusion pipelines. By removing the need for precise ground-truth, hardware synchronization, or extensive calibration, this method makes dense, accurate depth reconstruction feasible for mobile or embedded depth sensors in real-world settings.
Abstract
Consumer-level depth cameras and depth sensors embedded in mobile devices enable numerous applications, such as AR games and face identification. However, the quality of the captured depth is sometimes insufficient for 3D reconstruction, tracking and other computer vision tasks. In this paper, we propose a self-supervised depth denoising approach to denoise and refine depth coming from a low quality sensor. We record simultaneous RGB-D sequences with unzynchronized lower- and higher-quality cameras and solve a challenging problem of aligning sequences both temporally and spatially. We then learn a deep neural network to denoise the lower-quality depth using the matched higher-quality data as a source of supervision signal. We experimentally validate our method against state-of-the-art filtering-based and deep denoising techniques and show its application for 3D object reconstruction tasks where our approach leads to more detailed fused surfaces and better tracking.
