Table of Contents
Fetching ...

FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models

Andrew Caunes, Thierry Chateau, Vincent Fremont

TL;DR

FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images, and highlights foundation-model-driven perception as a practical route to training-free 3D scene understanding.

Abstract

Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle's surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.

FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models

TL;DR

FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images, and highlights foundation-model-driven perception as a practical route to training-free 3D scene understanding.

Abstract

Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle's surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.
Paper Structure (14 sections, 6 equations, 5 figures, 5 tables)

This paper contains 14 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: FreeOcc: Training-free panoptic occupancy prediction from foundation models. The pipeline operates directly from camera images and poses without any training. It combines a Semantic Branch powered by the SegmentAnything foundation model (SAM3) and a Geometric Branch powered by the MapAnything 3D reconstruction model to obtain 3D semantically labeled point clouds with instance priors from SAM3's masks. An Instance Identification module then refines instance candidates and re-assigns wrongly labeled points to the correct instance and semantic class. Finally, the fused point cloud is voxelized and refined to produce the final panoptic occupancy grid.
  • Figure 2: Semantic branch. A set of easily handcrafted prompts is fed to SegmentAnything3 kirillov_segment_2023, which yields 2D mask candidates with scores; we fuse them into per-view semantic and instance priors and use rules to remap prompt classes to the target taxonomy. Multiple prompts can be used for each target class, e.g. synonyms.
  • Figure 3: geometric branch. In parallel, MapAnything outputs per-pixel 3D points along with depth and confidence maps. We filter points by depth/confidence and keep the remaining labeled 3D points, yielding a sparse point cloud.
  • Figure 4: instance identification. Prompted SAM3 yields 2D mask candidates with scores; we fuse them into per-view semantic and instance priors and remap prompts to a canonical taxonomy with a simple lookup. MapAnything yields metric depth with a confidence signal; after reliability filtering we lift pixels to labeled 3D points. We optionally fuse points over time (non-causal or causal by frame selection), regularize thing instances via current-sample 3D box candidates, then voxelize and refine to produce semantic and panoptic occupancy grids.
  • Figure 5: voxelization branch. We voxelize the labeled point cloud and apply a lightweight multi-stage refinement step to produce semantic and panoptic occupancy grids. Notice that small holes, ego vehicle noise and backgrounded ghost artifacts are removed in this stage.