Table of Contents
Fetching ...

OPUS: Occupancy Prediction Using a Sparse Set

Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, Ming-Ming Cheng

TL;DR

A novel perspective on the occupancy prediction task is presented: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures, and the proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries.

Abstract

Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.

OPUS: Occupancy Prediction Using a Sparse Set

TL;DR

A novel perspective on the occupancy prediction task is presented: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures, and the proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries.

Abstract

Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
Paper Structure (25 sections, 6 equations, 10 figures, 9 tables)

This paper contains 25 sections, 6 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The occupancy prediction is approached as a set prediction problem. For each scene, we predict a set of point positions $\mathbb{P}$ and a set of the corresponding semantic classes $\mathbb{C}$. With the ground-truth set of occupied voxel positions $\mathbb{P}_g$ and classes $\mathbb{C}_g$, we decouple the set-to-set matching task into two distinct components: (a) Enforcing similarity in the point distributions of $\mathbb{P}$ and $\mathbb{P}_g$ using the Chamfer distance. (b) Aligning the predicted classes $\mathbb{C}$ with the ground-truths $\hat{\mathbb{C}} = \Phi(\mathbb{P}, \mathbb{P}_g, \mathbb{C}_g)$, where $\Phi$ generates a set of classes for points $\mathbb{P}$ based on those of the nearest ground-truth points.
  • Figure 2: OPUS leverages a transformer encoder-decoder architecture comprising: (1) An image encoder to extract 2D features from multi-view images. (2) A series of decoders to refine the queries with image features, which are correlated via the consistent point sampling module. (3) A set of learnable queries to predict locations and classes of occupancy points. Each query obeys a coarse-to-fine rule, progressively increasing the number of predicted points. In the end, the entire model is trained end-to-end using our adaptively re-weighted set-to-set losses.
  • Figure 3: Visualizations of occupancy predictions. Best viewed in color.
  • Figure 4: Visualizations of the coarse-to-fine predictions.
  • Figure 5: Distributions of standard deviations of points from one query.
  • ...and 5 more figures