Table of Contents
Fetching ...

OccFeat: Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation Networks

Sophia Sirko-Galouchenko, Alexandre Boulch, Spyros Gidaris, Andrei Bursuc, Antonin Vobecky, Patrick Pérez, Renaud Marlet

TL;DR

A self-supervised pretraining method for camera-only Bird's-Eye-View (BEV) segmentation networks via occupancy prediction and feature distillation tasks, and empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in this pretraining approach.

Abstract

We introduce a self-supervised pretraining method, called OccFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach. Repository: https://github.com/valeoai/Occfeat

OccFeat: Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation Networks

TL;DR

A self-supervised pretraining method for camera-only Bird's-Eye-View (BEV) segmentation networks via occupancy prediction and feature distillation tasks, and empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in this pretraining approach.

Abstract

We introduce a self-supervised pretraining method, called OccFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach. Repository: https://github.com/valeoai/Occfeat
Paper Structure (15 sections, 3 equations, 6 figures, 4 tables)

This paper contains 15 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Performance comparison in low data regime (1% annotated data of nuScenes)
  • Figure 2: Overview of OccFeat's self-supervised BEV pretraining approach. OccFeat attaches an auxiliary pretraining head on top of the BEV network. This head "unsplats" the BEV features to a 3D feature volume and predicts with it (a) the 3D occupancy of the scene (occupancy reconstruction loss) and (b) high-level self-supervised image features characterizing the occupied voxels (occupancy-guided distillation loss). The occupancy targets are produced by "voxelizing" Lidar points (see \ref{['fig:occ_voxels']}), while the self-supervised image foundation model DINOv2 provides the feature targets for the occupied voxels. The pretraining head is removed after the pretraining.
  • Figure 3: Occupancy grid. A voxel is considered occupied if there is at least one point inside it.
  • Figure 4: Study on robustness. Segmentation results on nuScenes-C dataset for Vehicle classes using BEVFormer network with EN-B0 image backbone on 100% annotated data. Comparison of our OccFeat against no BEV pretraining.
  • Figure 5: Visualisation of predicted 3D features, using a 3-dimensional PCA mapped on RGB channels. The features contain semantic information, e.g., cars in cyan color. Using the same PCA mapping on a different scene (right), we show that semantic features are consistent across scenes. Objects within colored circles in the feature space correspond to those in similarly colored circles in the image.
  • ...and 1 more figures