Table of Contents
Fetching ...

OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments

Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, Jiwen Lu

TL;DR

OccNeRF tackles LiDAR-free 3D occupancy prediction by learning occupancy fields from multi-camera imagery without 3D supervision. It introduces a parameterized, unbounded occupancy representation and uses neural rendering with multi-frame photometric supervision, plus open-vocabulary semantic prompts to supervise semantics. The method achieves strong results in self-supervised depth estimation and competitive occupancy prediction on nuScenes and SemanticKITTI, demonstrating data efficiency and scalability. The work advances vision-centric 3D scene understanding for autonomous driving by removing dependence on LiDAR data and enabling flexible semantic labeling through open-vocabulary models.

Abstract

Occupancy prediction reconstructs 3D structures of surrounding environments. It provides detailed information for autonomous driving planning and navigation. However, most existing methods heavily rely on the LiDAR point clouds to generate occupancy ground truth, which is not available in the vision-based system. In this paper, we propose an OccNeRF method for training occupancy networks without 3D supervision. Different from previous works which consider a bounded scene, we parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range. The neural rendering is adopted to convert occupancy fields to multi-camera depth maps, supervised by multi-frame photometric consistency. Moreover, for semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model. Extensive experiments for both self-supervised depth estimation and 3D occupancy prediction tasks on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our method.

OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments

TL;DR

OccNeRF tackles LiDAR-free 3D occupancy prediction by learning occupancy fields from multi-camera imagery without 3D supervision. It introduces a parameterized, unbounded occupancy representation and uses neural rendering with multi-frame photometric supervision, plus open-vocabulary semantic prompts to supervise semantics. The method achieves strong results in self-supervised depth estimation and competitive occupancy prediction on nuScenes and SemanticKITTI, demonstrating data efficiency and scalability. The work advances vision-centric 3D scene understanding for autonomous driving by removing dependence on LiDAR data and enabling flexible semantic labeling through open-vocabulary models.

Abstract

Occupancy prediction reconstructs 3D structures of surrounding environments. It provides detailed information for autonomous driving planning and navigation. However, most existing methods heavily rely on the LiDAR point clouds to generate occupancy ground truth, which is not available in the vision-based system. In this paper, we propose an OccNeRF method for training occupancy networks without 3D supervision. Different from previous works which consider a bounded scene, we parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range. The neural rendering is adopted to convert occupancy fields to multi-camera depth maps, supervised by multi-frame photometric consistency. Moreover, for semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model. Extensive experiments for both self-supervised depth estimation and 3D occupancy prediction tasks on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our method.
Paper Structure (19 sections, 11 equations, 11 figures, 10 tables)

This paper contains 19 sections, 11 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: The overview of OccNeRF. To represent unbounded scenes, we propose a parameterized coordinate to contract infinite space to bounded occupancy fields. Without using any LiDAR data or annotated labels, we leverage temporal photometric constraints and pretrained open-vocabulary segmentation models to provide geometric and semantic supervision.
  • Figure 2: (a) During inference, a 2D backbone first extracts features from multiple cameras, which are then projected and interpolated into 3D space to create volume features. These are used to reconstruct parameterized occupancy fields that capture the extent of unbounded scenes. (b) During training, to generate rendered depth and semantic maps, we employ volume rendering using a redesigned sampling strategy. The depths from multiple frames are refined through the photometric loss. For the semantic prediction, we utilize pretrained Grounded-SAM model enhanced with prompt cleaning. The green arrow denotes the supervision signal.
  • Figure 3: Comparison between original space and parameterized space. The original space utilizes the conventional Euclidean space, emphasizing linear mapping. The parameterized space is divided into two parts: an inner space with linear mapping to preserve high-resolution details and an outer space where point distribution is scaled inversely with distance. This design facilitates the representation of an infinite range within a finite spatial domain.
  • Figure 4: Label generation. Detection bounding boxes generated by our Grounding DINO and semantic labels predicted by SAM in our method exhibit precision, which is comparable with that of LiDAR points projection labels.
  • Figure 5: Qualitative results on nuScenes dataset nuscenes. Our method can predict visually appealing depth maps with texture details and fine-grained occupancy. Better viewed when zoomed in.
  • ...and 6 more figures