Table of Contents
Fetching ...

Multi-modal NeRF Self-Supervision for LiDAR Semantic Segmentation

Xavier Timoneda, Markus Herb, Fabian Duerr, Daniel Goehring, Fisher Yu

TL;DR

A Semi-Supervised Learning setup to leverage unlabeled LiDAR pointclouds alongside distilled knowledge from the camera images and shows the effectiveness of this approach in three public LiDAR Semantic Segmentation benchmarks: nuScenes, SemanticKITTI and ScribbleKITTI.

Abstract

LiDAR Semantic Segmentation is a fundamental task in autonomous driving perception consisting of associating each LiDAR point to a semantic label. Fully-supervised models have widely tackled this task, but they require labels for each scan, which either limits their domain or requires impractical amounts of expensive annotations. Camera images, which are generally recorded alongside LiDAR pointclouds, can be processed by the widely available 2D foundation models, which are generic and dataset-agnostic. However, distilling knowledge from 2D data to improve LiDAR perception raises domain adaptation challenges. For example, the classical perspective projection suffers from the parallax effect produced by the position shift between both sensors at their respective capture times. We propose a Semi-Supervised Learning setup to leverage unlabeled LiDAR pointclouds alongside distilled knowledge from the camera images. To self-supervise our model on the unlabeled scans, we add an auxiliary NeRF head and cast rays from the camera viewpoint over the unlabeled voxel features. The NeRF head predicts densities and semantic logits at each sampled ray location which are used for rendering pixel semantics. Concurrently, we query the Segment-Anything (SAM) foundation model with the camera image to generate a set of unlabeled generic masks. We fuse the masks with the rendered pixel semantics from LiDAR to produce pseudo-labels that supervise the pixel predictions. During inference, we drop the NeRF head and run our model with only LiDAR. We show the effectiveness of our approach in three public LiDAR Semantic Segmentation benchmarks: nuScenes, SemanticKITTI and ScribbleKITTI.

Multi-modal NeRF Self-Supervision for LiDAR Semantic Segmentation

TL;DR

A Semi-Supervised Learning setup to leverage unlabeled LiDAR pointclouds alongside distilled knowledge from the camera images and shows the effectiveness of this approach in three public LiDAR Semantic Segmentation benchmarks: nuScenes, SemanticKITTI and ScribbleKITTI.

Abstract

LiDAR Semantic Segmentation is a fundamental task in autonomous driving perception consisting of associating each LiDAR point to a semantic label. Fully-supervised models have widely tackled this task, but they require labels for each scan, which either limits their domain or requires impractical amounts of expensive annotations. Camera images, which are generally recorded alongside LiDAR pointclouds, can be processed by the widely available 2D foundation models, which are generic and dataset-agnostic. However, distilling knowledge from 2D data to improve LiDAR perception raises domain adaptation challenges. For example, the classical perspective projection suffers from the parallax effect produced by the position shift between both sensors at their respective capture times. We propose a Semi-Supervised Learning setup to leverage unlabeled LiDAR pointclouds alongside distilled knowledge from the camera images. To self-supervise our model on the unlabeled scans, we add an auxiliary NeRF head and cast rays from the camera viewpoint over the unlabeled voxel features. The NeRF head predicts densities and semantic logits at each sampled ray location which are used for rendering pixel semantics. Concurrently, we query the Segment-Anything (SAM) foundation model with the camera image to generate a set of unlabeled generic masks. We fuse the masks with the rendered pixel semantics from LiDAR to produce pseudo-labels that supervise the pixel predictions. During inference, we drop the NeRF head and run our model with only LiDAR. We show the effectiveness of our approach in three public LiDAR Semantic Segmentation benchmarks: nuScenes, SemanticKITTI and ScribbleKITTI.

Paper Structure

This paper contains 5 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Pixel pseudo-label generation. Our Semi-Supervised setup leverages unlabeled multi-modal data by casting rays from the camera viewpoint into the unlabeled LiDAR features to render pixel semantic predictions. The renderings are fused with unlabeled generic masks from SAM foundation model to produce confident, refined pixel pseudo-labels.
  • Figure 2: Overview of our method. During training, labeled $x^{l}$ and unlabeled $x^{u}$ scans are processed in parallel by a 3D U-Net. The voxel features $\mathcal{U}^{l}$ of the labeled scan are processed by both $vox_{3D}$ and $NeRF$ heads to get point-wise semantic predictions. These are compared against the 3D Ground-Truth $y^{l}$ to form the supervised 3D losses $\mathcal{L}_{3D_{vox}}^{l}$ and $\mathcal{L}_{3D_{NeRF}}^{l}$. Concurrently, for each unlabeled image $z^{u}$ from $x^{u}$, we obtain generic masks $\mathcal{S}^{u}$ with ${F}_{SAM}$ foundation model. We trace $P$ rays from the camera origin with each pixel's direction, and sample the unlabeled voxel features $\mathcal{U}^{u}$ at $M$ locations along each ray. These are processed by the $NeRF$ head to predict $P \times M$ semantic logits $\hat{l}_{m}$ and densities $\hat{\sigma}_{m}$. We integrate $\hat{l}_{m}$ and $\hat{\sigma}_{m}$ along the ray to render the per-pixel class probabilities $\hat{y}_p$. The confidence sampler merges $\hat{y}_p$ with the masks $\mathcal{S}^{u}$ to get refined pseudo-labels $\hat{\mathcal{C}}_{s}$. The predictions $\hat{y}_p$ are compared against the pseudo-labels $\hat{\mathcal{C}}_{s}$ to form the self-supervised 2D loss $\mathcal{L}_{2D_{NeRF}}^{u}$. At inference time, we remove all components represented in yellow, resulting in a LiDAR-only inference.
  • Figure 3: Entropy of the pixel renderings. In this context, we use the entropy $\mathcal{H}$ as a measure of the uncertainty in the predicted class probabilities of each rendered pixel. Bright colors denote high entropy. We observe the highest entropy $\mathcal{H}$ concentrates at object boundaries.
  • Figure 4: Qualitative analysis. For each example of SemanticKITTI semantickitti, we show the rendered pixel semantics $\hat{y}_{p}$ from the unlabeled voxel features and the generated pseudo-labels $\hat{\mathcal{C}}_{s}$ that supervise the pixel predictions.
  • Figure 5: Qualitative results. Error maps from LiDAR bird's eye view on 1% split of nuScenes nuscenes. The first row shows the Ground-Truth labels of each example. The second and third row show the correct and incorrect predictions painted in blue and orange for Sup.-Only and Ours models, respectively. The red boxes highlight the regions with notable differences.
  • ...and 1 more figures