Table of Contents
Fetching ...

SLCF-Net: Sequential LiDAR-Camera Fusion for Semantic Scene Completion using a 3D Recurrent U-Net

Helin Cao, Sven Behnke

TL;DR

SLCF-Net addresses semantic scene completion from sequences of RGB images and sparse LiDAR by fusing 2D image features and depth priors into a 3D voxel volume. It introduces Gaussian-decay Depth-prior Projection to back-project 2D features into 3D and a 3D recurrent U-Net to propagate temporal information across frames, coupled with a temporal consistency loss. The method achieves state-of-the-art performance on SemanticKITTI for both scene completion and semantic completion, with strong temporal coherence demonstrated on validation data. This approach enables more reliable outdoor scene understanding for autonomous driving using a practical RGB plus sparse LiDAR setup, while suggesting future work on dynamic objects and scene flow for further improvement.

Abstract

We introduce SLCF-Net, a novel approach for the Semantic Scene Completion (SSC) task that sequentially fuses LiDAR and camera data. It jointly estimates missing geometry and semantics in a scene from sequences of RGB images and sparse LiDAR measurements. The images are semantically segmented by a pre-trained 2D U-Net and a dense depth prior is estimated from a depth-conditioned pipeline fueled by Depth Anything. To associate the 2D image features with the 3D scene volume, we introduce Gaussian-decay Depth-prior Projection (GDP). This module projects the 2D features into the 3D volume along the line of sight with a Gaussian-decay function, centered around the depth prior. Volumetric semantics is computed by a 3D U-Net. We propagate the hidden 3D U-Net state using the sensor motion and design a novel loss to ensure temporal consistency. We evaluate our approach on the SemanticKITTI dataset and compare it with leading SSC approaches. The SLCF-Net excels in all SSC metrics and shows great temporal consistency.

SLCF-Net: Sequential LiDAR-Camera Fusion for Semantic Scene Completion using a 3D Recurrent U-Net

TL;DR

SLCF-Net addresses semantic scene completion from sequences of RGB images and sparse LiDAR by fusing 2D image features and depth priors into a 3D voxel volume. It introduces Gaussian-decay Depth-prior Projection to back-project 2D features into 3D and a 3D recurrent U-Net to propagate temporal information across frames, coupled with a temporal consistency loss. The method achieves state-of-the-art performance on SemanticKITTI for both scene completion and semantic completion, with strong temporal coherence demonstrated on validation data. This approach enables more reliable outdoor scene understanding for autonomous driving using a practical RGB plus sparse LiDAR setup, while suggesting future work on dynamic objects and scene flow for further improvement.

Abstract

We introduce SLCF-Net, a novel approach for the Semantic Scene Completion (SSC) task that sequentially fuses LiDAR and camera data. It jointly estimates missing geometry and semantics in a scene from sequences of RGB images and sparse LiDAR measurements. The images are semantically segmented by a pre-trained 2D U-Net and a dense depth prior is estimated from a depth-conditioned pipeline fueled by Depth Anything. To associate the 2D image features with the 3D scene volume, we introduce Gaussian-decay Depth-prior Projection (GDP). This module projects the 2D features into the 3D volume along the line of sight with a Gaussian-decay function, centered around the depth prior. Volumetric semantics is computed by a 3D U-Net. We propagate the hidden 3D U-Net state using the sensor motion and design a novel loss to ensure temporal consistency. We evaluate our approach on the SemanticKITTI dataset and compare it with leading SSC approaches. The SLCF-Net excels in all SSC metrics and shows great temporal consistency.
Paper Structure (21 sections, 6 equations, 6 figures, 3 tables)

This paper contains 21 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: SLCF-Net estimates the dense semantic scene as shown in (c) using sequences of RGB images (a) and aligned sparse LiDAR depth maps (b). Both (c) and (d) depict a voxelized scene as defined by the SemanticKITTI Benchmark behley2019iccv from the bird's-eye view. Parts of the scene in both the estimation (c) and the ground-truth (d) lie outside of the field of view (FoV), which are visualized as shadow areas. The unknown areas, as defined by the ground truth, are visualized at $20\%$ opacity in (c).
  • Figure 2: Overall pipeline of SLCF-Net. Given input sequences consisting of RGB images and a sparse depth map projected from a single sweep point cloud, the process is initiated by extracting the image feature into two channels. The 2D semantic features are extracted by an EfficientNet tan2019efficientnet with noisy student training xie2020self, while the relative depth is estimated by the Depth Anything Model. The relative depth is then scaled based on the sparse depth input to generate depth prior of entire image. Afterward, the Gaussian-decay Depth-prior Projection (GDP) module distributively back-projects the 2D features onto a predefined 3D volume using the depth priors. The 3D features are then fed into a 3D recurrent U-Net, which enables the harness of information from the previous frame. Finally, a dense grid semantic scene is generated as a comprehensive understanding of the environment.
  • Figure 3: Gaussian-decay Depth-prior Projection (GDP). (a) a 2D feature located at pixel coordinate $\bm{p}$ is projected to voxels in the 3D volume, following the line of sight. Using the depth prior $\hat{d}$, $\bm{P}$ is considered the most probable point and serves as the center of the Gaussian-decay function; (b) Gaussian-decay function weight.
  • Figure 4: Concept of temporal feature propagation. In consecutive frames $C_{t-1}$ and $C_t$, the blue and red cubes represent the defined volumes, respectively. The previous volume $V_{t-1}$ is aligned to the current volume $V_{t}$ via coordinate transformation. The overlap area, denoted as $V_{overlap}$ and visualized in purple, is repeatedly estimated by neighboring frames so should be consistent. Then the feature located at the same global position is concatenated to propagate information across frames.
  • Figure 5: Qualitative results on SemanticKITTI. From the bird's eye view, the 19 classes are shown without empty space. The estimated voxels that are located at the unknown region are visualized with $20\%$ opacity. The region located outside of the FoV is shaded.
  • ...and 1 more figures