OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence
Yu Zheng, Jie Hu, Kailun Yang, Jiaming Zhang
TL;DR
The paper introduces OccSTeP, a 4D occupancy spatio-temporal persistence framework for autonomous driving that addresses reactive and proactive forecasting. It proposes a tokenizer-free voxel world model, OccSTeP-WM, built on linear-time Mamba attention and an incremental spatio-temporal priors fusion with SE(3) warping to enable online, robust predictions under noisy or missing history. A new OccSTeP benchmark with four disturbance regimes evaluates persistence, robustness, and action-conditioned rollout, where OccSTeP-WM achieves state-of-the-art gains in semantic mIoU and occupancy IoU. The work provides extensive ablations, practical implementation details, and an open-source release to facilitate robust, online world modeling for planning in dynamic driving environments.
Abstract
Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.
