SLCF-Net: Sequential LiDAR-Camera Fusion for Semantic Scene Completion using a 3D Recurrent U-Net
Helin Cao, Sven Behnke
TL;DR
SLCF-Net addresses semantic scene completion from sequences of RGB images and sparse LiDAR by fusing 2D image features and depth priors into a 3D voxel volume. It introduces Gaussian-decay Depth-prior Projection to back-project 2D features into 3D and a 3D recurrent U-Net to propagate temporal information across frames, coupled with a temporal consistency loss. The method achieves state-of-the-art performance on SemanticKITTI for both scene completion and semantic completion, with strong temporal coherence demonstrated on validation data. This approach enables more reliable outdoor scene understanding for autonomous driving using a practical RGB plus sparse LiDAR setup, while suggesting future work on dynamic objects and scene flow for further improvement.
Abstract
We introduce SLCF-Net, a novel approach for the Semantic Scene Completion (SSC) task that sequentially fuses LiDAR and camera data. It jointly estimates missing geometry and semantics in a scene from sequences of RGB images and sparse LiDAR measurements. The images are semantically segmented by a pre-trained 2D U-Net and a dense depth prior is estimated from a depth-conditioned pipeline fueled by Depth Anything. To associate the 2D image features with the 3D scene volume, we introduce Gaussian-decay Depth-prior Projection (GDP). This module projects the 2D features into the 3D volume along the line of sight with a Gaussian-decay function, centered around the depth prior. Volumetric semantics is computed by a 3D U-Net. We propagate the hidden 3D U-Net state using the sensor motion and design a novel loss to ensure temporal consistency. We evaluate our approach on the SemanticKITTI dataset and compare it with leading SSC approaches. The SLCF-Net excels in all SSC metrics and shows great temporal consistency.
