Table of Contents
Fetching ...

WildOcc: A Benchmark for Off-Road 3D Semantic Occupancy Prediction

Heng Zhai, Jilin Mei, Chen Min, Liang Chen, Fangzhou Zhao, Yu Hu

TL;DR

WildOcc is introduced, to the authors' knowledge, the first benchmark to provide dense occupancy annotations for off-road 3D semantic occupancy prediction tasks, and a multi-modal 3D semantic occupancy prediction framework, which fuses spatio-temporal information from multi-frame images and point clouds at voxel level.

Abstract

3D semantic occupancy prediction is an essential part of autonomous driving, focusing on capturing the geometric details of scenes. Off-road environments are rich in geometric information, therefore it is suitable for 3D semantic occupancy prediction tasks to reconstruct such scenes. However, most of researches concentrate on on-road environments, and few methods are designed for off-road 3D semantic occupancy prediction due to the lack of relevant datasets and benchmarks. In response to this gap, we introduce WildOcc, to our knowledge, the first benchmark to provide dense occupancy annotations for off-road 3D semantic occupancy prediction tasks. A ground truth generation pipeline is proposed in this paper, which employs a coarse-to-fine reconstruction to achieve a more realistic result. Moreover, we introduce a multi-modal 3D semantic occupancy prediction framework, which fuses spatio-temporal information from multi-frame images and point clouds at voxel level. In addition, a cross-modality distillation function is introduced, which transfers geometric knowledge from point clouds to image features.

WildOcc: A Benchmark for Off-Road 3D Semantic Occupancy Prediction

TL;DR

WildOcc is introduced, to the authors' knowledge, the first benchmark to provide dense occupancy annotations for off-road 3D semantic occupancy prediction tasks, and a multi-modal 3D semantic occupancy prediction framework, which fuses spatio-temporal information from multi-frame images and point clouds at voxel level.

Abstract

3D semantic occupancy prediction is an essential part of autonomous driving, focusing on capturing the geometric details of scenes. Off-road environments are rich in geometric information, therefore it is suitable for 3D semantic occupancy prediction tasks to reconstruct such scenes. However, most of researches concentrate on on-road environments, and few methods are designed for off-road 3D semantic occupancy prediction due to the lack of relevant datasets and benchmarks. In response to this gap, we introduce WildOcc, to our knowledge, the first benchmark to provide dense occupancy annotations for off-road 3D semantic occupancy prediction tasks. A ground truth generation pipeline is proposed in this paper, which employs a coarse-to-fine reconstruction to achieve a more realistic result. Moreover, we introduce a multi-modal 3D semantic occupancy prediction framework, which fuses spatio-temporal information from multi-frame images and point clouds at voxel level. In addition, a cross-modality distillation function is introduced, which transfers geometric knowledge from point clouds to image features.

Paper Structure

This paper contains 29 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: WildOcc provides dense semantic occupancy annotations for 10000 frames off-road dataset. To demonstrate the effect, here providing examples of large-scale annotations from top-view. (Better viewed when zoomed in.)
  • Figure 2: An example of results from different pipelines. (a) Image view of the scene. (b) Point clouds after multi-frame aggregation. (c) Annotations from pipeline of SurroundOccwei2023surroundocc, without coarse-to-fine reconstruction. (d) Annotations from pipeline of WildOcc (Ours), with coarse-to-fine reconstruction. Regions high-lighted by blue, yellow and red indicate that the coarse-to-fine reconstruction can make the annotations of off-road environments closer to real scene.
  • Figure 3: The overall architecture of framework OFFOcc. It consists of camera, LiDAR and multi-modal branches. To utilize the information of historical frames, we design a module of spatio-temporal alignment to combine the information. When training camera branch, we use LiDAR branch as the teacher and camera branch as the student, to transfer geometric knowledge from LiDAR branch.
  • Figure 4: Qualitative results of OffOcc and M-CONet on WildOcc dataset. The input monocular image and LiDAR sweeps are shown on the left. (Colors of LiDAR sweeps are convenient for demonstration, the actual LiDAR sweeps input does not contain semantic information. Better viewed when zoomed in.)