Table of Contents
Fetching ...

ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

Simon Boeder, Fabian Gigengack, Simon Roesler, Holger Caesar, Benjamin Risse

TL;DR

ShelfOcc introduces native 3D supervision for vision-based occupancy estimation by generating metrically consistent 3D voxel labels from multi-view images without LiDAR. It fuses 3D geometry priors (MapAnything) with 2D semantic masks (GroundedSAM) and employs static/dynamic separation, confidence filtering, and visibility masking to create high-quality 3D pseudo-labels. These labels are plug-and-play supervision for any 3D occupancy network, yielding substantial gains over prior shelf-supervised methods on Occ3D-nuScenes and closing the gap toward LiDAR-based supervision. The approach demonstrates a data-centric path to robust 3D understanding in driving scenes and motivates future work on 4D scene reasoning and long-tail semantic categories.

Abstract

Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.

ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

TL;DR

ShelfOcc introduces native 3D supervision for vision-based occupancy estimation by generating metrically consistent 3D voxel labels from multi-view images without LiDAR. It fuses 3D geometry priors (MapAnything) with 2D semantic masks (GroundedSAM) and employs static/dynamic separation, confidence filtering, and visibility masking to create high-quality 3D pseudo-labels. These labels are plug-and-play supervision for any 3D occupancy network, yielding substantial gains over prior shelf-supervised methods on Occ3D-nuScenes and closing the gap toward LiDAR-based supervision. The approach demonstrates a data-centric path to robust 3D understanding in driving scenes and motivates future work on 4D scene reasoning and long-tail semantic categories.

Abstract

Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.

Paper Structure

This paper contains 32 sections, 2 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Contributions of ShelfOcc. We propose a shift in supervision strategy for weakly/shelf-supervised occupancy estimation. Unlike prior 2D rendering-based approaches, which are prone to depth bleeding, ShelfOcc trains occupancy networks directly in native 3D voxel space with pseudo-labels generated using a combination of geometric and semantic FMs. By accumulating and filtering static geometry while handling dynamic objects separately, our approach yields clean and consistent 3D supervision relying only on images, without LiDAR. This shift in supervision leads to a significant performance gain over previous methods, as illustrated on the right.
  • Figure 2: Overview of the ShelfOcc framework. We leverage a 3D geometry foundation model (MapAnything keetha2025mapanything) and a 2D semantic foundation model (GroundedSAM ren2024grounded) to construct precise 3D semantic voxel pseudo-labels. The pipeline processes image sequences, segregating static and dynamic scene elements, filtering and aggregating static elements and carefully reintroducing dynamic objects to mitigate artifacts. These generated 3D pseudo-labels serve as a plug-and-play supervision for any 3D occupancy network.
  • Figure 3: Qualitative results on the Occ3D-nuScenes dataset. We show the images, ground truth occupancy, the ShelfOcc pseudo-labels and the predictions of ShelfOcc + STCOcc liao2025stcocc. Best viewed when zoomed in.
  • Figure A.1: Qualitative comparison with previous state-of-the-art. We show predictions from STCOcc liao2025stcocc trained on our ShelfOcc pseudo-labels, compared against GaussianFlowOcc Boeder_2025_ICCV and the Occ3D-nuScenes ground truth. STCOcc produces cleaner and more geometrically consistent occupancy predictions, demonstrating the benefits of our 3D supervision.
  • Figure A.2: Qualitative comparison of the different versions of our proposed pipeline. We visualize pseudo-labels produced by the three pipeline variants introduced in the main paper: (1) the naïve single-frame approach, (2) full temporal aggregation without handling motion, and (3) our final design, which aggregates static geometry while treating dynamic objects separately. The comparison highlights how version 3 avoids sparsity, object trails, and missing objects, resulting in clean and coherent 3D supervision.