Table of Contents
Fetching ...

Test-Time 3D Occupancy Prediction

Fengyi Zhang, Xiangyu Sun, Huitong Yang, Zheng Zhang, Zi Huang, Yadan Luo

TL;DR

TT-Occ addresses the high cost and rigidity of traditional dense 3D occupancy decoders by introducing a training-free test-time framework that builds time-aware 3D Gaussians from raw sensor data and vision foundation models. It lifts geometric and semantic cues into Gaussian primitives, tracks dynamic elements, and voxelizes predictions at arbitrary resolutions, all without network training. The method supports open-vocabulary semantics and works with either LiDAR or camera streams, achieving competitive results on Occ3D-nuScenes and nuCraft and demonstrating clear temporal coherence. These contributions offer a practical pathway to deploy robust occupancy prediction in real-world driving systems with flexible sensor and semantic integration.

Abstract

Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VFMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VFM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions.

Test-Time 3D Occupancy Prediction

TL;DR

TT-Occ addresses the high cost and rigidity of traditional dense 3D occupancy decoders by introducing a training-free test-time framework that builds time-aware 3D Gaussians from raw sensor data and vision foundation models. It lifts geometric and semantic cues into Gaussian primitives, tracks dynamic elements, and voxelizes predictions at arbitrary resolutions, all without network training. The method supports open-vocabulary semantics and works with either LiDAR or camera streams, achieving competitive results on Occ3D-nuScenes and nuCraft and demonstrating clear temporal coherence. These contributions offer a practical pathway to deploy robust occupancy prediction in real-world driving systems with flexible sensor and semantic integration.

Abstract

Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VFMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VFM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions.

Paper Structure

This paper contains 19 sections, 7 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Comparison of self-supervised occupancy prediction methods in terms of pretraining time (x-axis), mIoU (y-axis), and runtime FPS (marker radius). TT-Occ achieves strong mIoU and competitive FPS without any pretraining.
  • Figure 2: Overview of the proposed TT-OccCamera and TT-OccLiDAR approaches. TT-Occ performs test-time 3D occupancy estimation by directly integrating a suite of VFMs at runtime, avoiding any network training or fine-tuning. (a) Multi-view semantics are obtained using arbitrary open-vocabulary segmentation VFMs (OpenSeed, GroundingSAM, REX-Omni). (b) Geometry cues are extracted via depth/correspondence VFMs (VGGT, MapAnything). (c) Dynamic flow is estimated to track moving objects and prevent trailing artifacts while accumulating static structure over time. (d) The resulting features instantiate and refine time-aware 3D Gaussians, which can be voxelized at any resolutions for occupancy prediction. (e) TT-Occ supports both LiDAR-based and camera-only variants and enables multi-resolution voxelization and semi-automatic annotation. Despite leveraging multiple VFMs, TT-Occ remains highly efficient, delivering competitive occupancy performance across Occ3D-nuScenes and nuCraft.
  • Figure 3: Illustration of trailing artifacts caused by naïvely accumulating per-frame Gaussians. Without handling dynamic regions, moving objects (e.g., cars shown in blue voxels) leave behind smeared or duplicated structures, corrupting the occupancy field. Our tracking suppresses these artifacts by separating dynamic Gaussians from static ones.
  • Figure 4: Qualitative comparisons on nuCraft nuCraft between both variants of the proposed TT-Occ and SelfOcc SelfOcc.
  • Figure 6: Inference time comparison among TT-OccCamera, TT-OccLiDAR, and SelfOcc SelfOcc. The horizontal stacked bars illustrate the per-module runtime composition of each module.
  • ...and 9 more figures