Table of Contents
Fetching ...

Occupancy as Set of Points

Yiang Shi, Tianheng Cheng, Qian Zhang, Wenyu Liu, Xinggang Wang

TL;DR

This work introduces Occupancy as Set of Points (OSP), a point-based framework for 3D occupancy prediction from multi-view images that foregrounds Points of Interest (PoIs) to enable flexible, area-focused inference beyond traditional dense volume representations. OSP uses a Transformer-based pipeline with a 3D Position Encoder and a decoder that employs Point Cross-Attention and Group Point Cross-Attention to fuse 2D image features with sparse 3D queries, plus adaptive oversampling to capture local context. The approach is validated on the Occ3D-nuScenes benchmark, achieving strong performance (e.g., 39.41 mIoU) and demonstrating clear advantages over volume-based baselines, as well as providing a plug-in capability to enhance BEVFormer. The key contributions include the PoI-based occupancy representation, the three PoI types (Standard Grids, Adaptively Sampling, Manually Sampling), and the demonstrated flexibility to sample any area, including regions beyond the perception range, while maintaining competitive accuracy and efficiency.

Abstract

In this paper, we explore a novel point representation for 3D occupancy prediction from multi-view images, which is named Occupancy as Set of Points. Existing camera-based methods tend to exploit dense volume-based representation to predict the occupancy of the whole scene, making it hard to focus on the special areas or areas out of the perception range. In comparison, we present the Points of Interest (PoIs) to represent the scene and propose OSP, a novel framework for point-based 3D occupancy prediction. Owing to the inherent flexibility of the point-based representation, OSP achieves strong performance compared with existing methods and excels in terms of training and inference adaptability. It extends beyond traditional perception boundaries and can be seamlessly integrated with volume-based methods to significantly enhance their effectiveness. Experiments on the Occ3D nuScenes occupancy benchmark show that OSP has strong performance and flexibility. Code and models are available at \url{https://github.com/hustvl/osp}.

Occupancy as Set of Points

TL;DR

This work introduces Occupancy as Set of Points (OSP), a point-based framework for 3D occupancy prediction from multi-view images that foregrounds Points of Interest (PoIs) to enable flexible, area-focused inference beyond traditional dense volume representations. OSP uses a Transformer-based pipeline with a 3D Position Encoder and a decoder that employs Point Cross-Attention and Group Point Cross-Attention to fuse 2D image features with sparse 3D queries, plus adaptive oversampling to capture local context. The approach is validated on the Occ3D-nuScenes benchmark, achieving strong performance (e.g., 39.41 mIoU) and demonstrating clear advantages over volume-based baselines, as well as providing a plug-in capability to enhance BEVFormer. The key contributions include the PoI-based occupancy representation, the three PoI types (Standard Grids, Adaptively Sampling, Manually Sampling), and the demonstrated flexibility to sample any area, including regions beyond the perception range, while maintaining competitive accuracy and efficiency.

Abstract

In this paper, we explore a novel point representation for 3D occupancy prediction from multi-view images, which is named Occupancy as Set of Points. Existing camera-based methods tend to exploit dense volume-based representation to predict the occupancy of the whole scene, making it hard to focus on the special areas or areas out of the perception range. In comparison, we present the Points of Interest (PoIs) to represent the scene and propose OSP, a novel framework for point-based 3D occupancy prediction. Owing to the inherent flexibility of the point-based representation, OSP achieves strong performance compared with existing methods and excels in terms of training and inference adaptability. It extends beyond traditional perception boundaries and can be seamlessly integrated with volume-based methods to significantly enhance their effectiveness. Experiments on the Occ3D nuScenes occupancy benchmark show that OSP has strong performance and flexibility. Code and models are available at \url{https://github.com/hustvl/osp}.
Paper Structure (33 sections, 8 equations, 4 figures, 8 tables)

This paper contains 33 sections, 8 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison between volume-based methods and our method. The volume-based methods, represented by BEVFormer, infers every region within the scene and gets standard occupancy as shown in (a). Our method uses a point-based decoder as shown in (b). Thus it infers the Points of Interest including standard, adaptively sampled, and manually sampled grids as shown in (c).
  • Figure 2: Overall framework of Occupancy as Set of Points. OSP leverages the Transformer architecture to derive 3D point features from 2D images to make 3D occupancy predictions. Initially, we extract 2D features from multi-view images. Following this, we employ a set of 3D point queries to index these 2D features. The selection of these 3D point queries depends on the Points of Interest (PoIs).
  • Figure 3: Pipeline of refining volume-based methods with OSP. Given RGB images, 2D features are extracted by the frozen image backbone of the volume-based method which in our case is the BEVFormer baseline. We use the volume-based decoder to infer the entire scene, the point decoder to infer our selected 3D points, and combine the results of both using a weighted sum method
  • Figure 4: Visualization of our results. Our visualization results have voids compared to the ground truth ego-vehicle position, which is due to the surrounding view having gaps near the ego-vehicle.