Test-Time 3D Occupancy Prediction
Fengyi Zhang, Xiangyu Sun, Huitong Yang, Zheng Zhang, Zi Huang, Yadan Luo
TL;DR
TT-Occ addresses the high cost and rigidity of traditional dense 3D occupancy decoders by introducing a training-free test-time framework that builds time-aware 3D Gaussians from raw sensor data and vision foundation models. It lifts geometric and semantic cues into Gaussian primitives, tracks dynamic elements, and voxelizes predictions at arbitrary resolutions, all without network training. The method supports open-vocabulary semantics and works with either LiDAR or camera streams, achieving competitive results on Occ3D-nuScenes and nuCraft and demonstrating clear temporal coherence. These contributions offer a practical pathway to deploy robust occupancy prediction in real-world driving systems with flexible sensor and semantic integration.
Abstract
Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VFMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VFM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions.
