ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

Simon Boeder; Fabian Gigengack; Simon Roesler; Holger Caesar; Benjamin Risse

ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

Simon Boeder, Fabian Gigengack, Simon Roesler, Holger Caesar, Benjamin Risse

TL;DR

ShelfOcc introduces native 3D supervision for vision-based occupancy estimation by generating metrically consistent 3D voxel labels from multi-view images without LiDAR. It fuses 3D geometry priors (MapAnything) with 2D semantic masks (GroundedSAM) and employs static/dynamic separation, confidence filtering, and visibility masking to create high-quality 3D pseudo-labels. These labels are plug-and-play supervision for any 3D occupancy network, yielding substantial gains over prior shelf-supervised methods on Occ3D-nuScenes and closing the gap toward LiDAR-based supervision. The approach demonstrates a data-centric path to robust 3D understanding in driving scenes and motivates future work on 4D scene reasoning and long-tail semantic categories.

Abstract

Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.

ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

TL;DR

Abstract

ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)