Table of Contents
Fetching ...

EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models

Seamie Hayes, Ganesh Sistu, Ciarán Eising

TL;DR

This work tackles semantic occupancy prediction in self-supervised settings by introducing 3D pseudo-labels generated from Grounded-SAM and Metric3Dv2, augmented with temporal densification to stabilize labels. The authors show how these labels can be integrated as a pseudo-loss into existing models and introduce EasyOcc, a streamlined voxel-based model that learns exclusively from the 3D pseudo-labels. Across both camera-masked and full-scene evaluations on Occ3D, 3D pseudo-label supervision yields substantial improvements in IoU and mIoU, reduces reliance on costly view synthesis, and enables better handling of occluded regions. The study also provides extensive ablations, revealing the importance of temporal aggregation, loss balancing, and Lovász loss design for optimal performance.

Abstract

Self-supervised models have recently achieved notable advancements, particularly in the domain of semantic occupancy prediction. These models utilize sophisticated loss computation strategies to compensate for the absence of ground-truth labels. For instance, techniques such as novel view synthesis, cross-view rendering, and depth estimation have been explored to address the issue of semantic and depth ambiguity. However, such techniques typically incur high computational costs and memory usage during the training stage, especially in the case of novel view synthesis. To mitigate these issues, we propose 3D pseudo-ground-truth labels generated by the foundation models Grounded-SAM and Metric3Dv2, and harness temporal information for label densification. Our 3D pseudo-labels can be easily integrated into existing models, which yields substantial performance improvements, with mIoU increasing by 45\%, from 9.73 to 14.09, when implemented into the OccNeRF model. This stands in contrast to earlier advancements in the field, which are often not readily transferable to other architectures. Additionally, we propose a streamlined model, EasyOcc, achieving 13.86 mIoU. This model conducts learning solely from our labels, avoiding complex rendering strategies mentioned previously. Furthermore, our method enables models to attain state-of-the-art performance when evaluated on the full scene without applying the camera mask, with EasyOcc achieving 7.71 mIoU, outperforming the previous best model by 31\%. These findings highlight the critical importance of foundation models, temporal context, and the choice of loss computation space in self-supervised learning for comprehensive scene understanding.

EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models

TL;DR

This work tackles semantic occupancy prediction in self-supervised settings by introducing 3D pseudo-labels generated from Grounded-SAM and Metric3Dv2, augmented with temporal densification to stabilize labels. The authors show how these labels can be integrated as a pseudo-loss into existing models and introduce EasyOcc, a streamlined voxel-based model that learns exclusively from the 3D pseudo-labels. Across both camera-masked and full-scene evaluations on Occ3D, 3D pseudo-label supervision yields substantial improvements in IoU and mIoU, reduces reliance on costly view synthesis, and enables better handling of occluded regions. The study also provides extensive ablations, revealing the importance of temporal aggregation, loss balancing, and Lovász loss design for optimal performance.

Abstract

Self-supervised models have recently achieved notable advancements, particularly in the domain of semantic occupancy prediction. These models utilize sophisticated loss computation strategies to compensate for the absence of ground-truth labels. For instance, techniques such as novel view synthesis, cross-view rendering, and depth estimation have been explored to address the issue of semantic and depth ambiguity. However, such techniques typically incur high computational costs and memory usage during the training stage, especially in the case of novel view synthesis. To mitigate these issues, we propose 3D pseudo-ground-truth labels generated by the foundation models Grounded-SAM and Metric3Dv2, and harness temporal information for label densification. Our 3D pseudo-labels can be easily integrated into existing models, which yields substantial performance improvements, with mIoU increasing by 45\%, from 9.73 to 14.09, when implemented into the OccNeRF model. This stands in contrast to earlier advancements in the field, which are often not readily transferable to other architectures. Additionally, we propose a streamlined model, EasyOcc, achieving 13.86 mIoU. This model conducts learning solely from our labels, avoiding complex rendering strategies mentioned previously. Furthermore, our method enables models to attain state-of-the-art performance when evaluated on the full scene without applying the camera mask, with EasyOcc achieving 7.71 mIoU, outperforming the previous best model by 31\%. These findings highlight the critical importance of foundation models, temporal context, and the choice of loss computation space in self-supervised learning for comprehensive scene understanding.

Paper Structure

This paper contains 39 sections, 11 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of the EasyOcc Model Architecture: Image features are first extracted from the six surrounding camera views using a 2D image encoder. These features are then projected into a 3D feature volume via bilinear sampling. The resulting voxel grid is processed by a 3D CNN to generate the final semantic occupancy prediction. Loss is computed by comparing the predicted output (top) with the corresponding 3D pseudo-labels (bottom).
  • Figure 2: Our Method of Generating 3D Pseudo-Labels: Grounded-SAM labels are first projected into 3D space using Metric3Dv2 depth maps and camera pose information to produce a semantic point cloud. To densify the point cloud, we aggregate 13 temporal samples while filtering out dynamic objects to avoid duplication. The resulting densified point cloud is then passed to a voxelization module to generate 3D pseudo-voxel labels.
  • Figure 3: Temporal Sample Aggregation Analysis: 3D pseudo-labels directly compared to ground-truth labels for various numbers of aggregated samples. A camera mask is applied to align with the model evaluation pipeline.
  • Figure 4: Occupancy Threshold Analysis: 3D pseudo-labels compared to ground-truth labels for various threshold values in the generation processes. A camera mask is applied.
  • Figure 5: Occ3D Ground Truth (top) and Our 3D Pseudo-Labels(bottom): Visual comparison of four samples from the Occ3D dataset with their corresponding 3D pseudo-labels generated by our method.
  • ...and 4 more figures