Table of Contents
Fetching ...

A Simple Framework for 3D Occupancy Estimation in Autonomous Driving

Wanshui Gan, Ningkai Mo, Hongbin Xu, Naoto Yokoya

TL;DR

This work introduces SimpleOccupancy, a CNN-based framework to estimate 3D occupancy from surrounding-view images for autonomous driving. It uses a parameter-free 2D-to-3D unprojection followed by 3D CNNs to produce voxel-wise occupancy probabilities, with an optional SDF-based reconstruction pathway; training encompasses both supervised depth-based losses and self-supervised photometric cues. A novel discrete depth metric for occupancy evaluation is proposed, and the approach is benchmarked on DDAD and Nuscenes, showing competitive depth estimation with effective 3D occupancy reconstruction. The authors also explore a point-level pretraining strategy and discuss connections to monocular depth estimation and semantic occupancy, culminating in a practical framework for 3D perception that can leverage unlabeled data for training.

Abstract

The task of estimating 3D occupancy from surrounding-view images is an exciting development in the field of autonomous driving, following the success of Bird's Eye View (BEV) perception. This task provides crucial 3D attributes of the driving environment, enhancing the overall understanding and perception of the surrounding space. In this work, we present a simple framework for 3D occupancy estimation, which is a CNN-based framework designed to reveal several key factors for 3D occupancy estimation, such as network design, optimization, and evaluation. In addition, we explore the relationship between 3D occupancy estimation and other related tasks, such as monocular depth estimation and 3D reconstruction, which could advance the study of 3D perception in autonomous driving. For evaluation, we propose a simple sampling strategy to define the metric for occupancy evaluation, which is flexible for current public datasets. Moreover, we establish the benchmark in terms of the depth estimation metric, where we compare our proposed method with monocular depth estimation methods on the DDAD and Nuscenes datasets and achieve competitive performance. The relevant code will be updated in https://github.com/GANWANSHUI/SimpleOccupancy.

A Simple Framework for 3D Occupancy Estimation in Autonomous Driving

TL;DR

This work introduces SimpleOccupancy, a CNN-based framework to estimate 3D occupancy from surrounding-view images for autonomous driving. It uses a parameter-free 2D-to-3D unprojection followed by 3D CNNs to produce voxel-wise occupancy probabilities, with an optional SDF-based reconstruction pathway; training encompasses both supervised depth-based losses and self-supervised photometric cues. A novel discrete depth metric for occupancy evaluation is proposed, and the approach is benchmarked on DDAD and Nuscenes, showing competitive depth estimation with effective 3D occupancy reconstruction. The authors also explore a point-level pretraining strategy and discuss connections to monocular depth estimation and semantic occupancy, culminating in a practical framework for 3D perception that can leverage unlabeled data for training.

Abstract

The task of estimating 3D occupancy from surrounding-view images is an exciting development in the field of autonomous driving, following the success of Bird's Eye View (BEV) perception. This task provides crucial 3D attributes of the driving environment, enhancing the overall understanding and perception of the surrounding space. In this work, we present a simple framework for 3D occupancy estimation, which is a CNN-based framework designed to reveal several key factors for 3D occupancy estimation, such as network design, optimization, and evaluation. In addition, we explore the relationship between 3D occupancy estimation and other related tasks, such as monocular depth estimation and 3D reconstruction, which could advance the study of 3D perception in autonomous driving. For evaluation, we propose a simple sampling strategy to define the metric for occupancy evaluation, which is flexible for current public datasets. Moreover, we establish the benchmark in terms of the depth estimation metric, where we compare our proposed method with monocular depth estimation methods on the DDAD and Nuscenes datasets and achieve competitive performance. The relevant code will be updated in https://github.com/GANWANSHUI/SimpleOccupancy.
Paper Structure (24 sections, 9 equations, 15 figures, 5 tables)

This paper contains 24 sections, 9 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: A comparison of the overall pipline of monocular depth estimation, stereo matching, and 3D occupancy estimation.
  • Figure 2: The overview of the proposed Simple 3D Occupancy estimation framework (SimpleOccupancy). Given the surrounding image, we first extract the image feature by the shared 2D CNN and then use the parameter-free interpolation to obtain the 3D volume. The following 3D CNN could effectively aggregate the 3D feature in the volume space (Section \ref{['3.2']}). At last, we train the proposed network for both the supervised learning from the sparse point cloud and the self-supervised learning with photometric consistency loss. (Section \ref{['3.4']}).
  • Figure 3: We use a collection of key points to represent the 3D space and evaluate it based on sample points. The classification label is shown on the upper side, while the bottom side compares two metrics - the classification metric and our newly proposed discrete depth metric - in two different prediction cases. Note that the unknown point is not involved in classification metric but in the proposed discrete depth metric. It is evident that the discrete depth metric accurately reflects the cost associated with each prediction.
  • Figure 4: The visualization for 3D occupancy and depth estimation ablation study of the proposed method (DDAD dataset ddad). The first row: the surrounding images and the rendered depth maps. For the second row: based on the occupancy label, we present the binary prediction, where the red, green, and white colors mean the false negative, true positive, and false positive, respectively. The third row is the dense occupancy prediction in the voxel grid, where the darker color means the occupancy is closer to the ego vehicle. The BCE loss, L1 loss, Depth loss, and Full model are related to the experiment setting (2), (3), (1), and (6) in Table \ref{['t:3D occupancy']} and \ref{['t:depth metric']}, respectively. Note that we omit the prediction under the 0.4 m for better visualization. Best viewed in color.
  • Figure 5: The depth map and mesh visualization for comparing the representation of density and signed distance function (SDF) under the self-supervised learning setting, Nuscenes nuscenes. The mesh for density is extracted with the threshold of 0.5. We visualize the mesh at the camera view with the height range from 0 to 5 m. We can learn that the density representation could extract a reasonable mesh for scene 1, but it can not work well in scene 2 in the same threshold. For SDF, we could make a good mesh prediction for both scene 1 and 2.
  • ...and 10 more figures