Table of Contents
Fetching ...

OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence

Yu Zheng, Jie Hu, Kailun Yang, Jiaming Zhang

TL;DR

The paper introduces OccSTeP, a 4D occupancy spatio-temporal persistence framework for autonomous driving that addresses reactive and proactive forecasting. It proposes a tokenizer-free voxel world model, OccSTeP-WM, built on linear-time Mamba attention and an incremental spatio-temporal priors fusion with SE(3) warping to enable online, robust predictions under noisy or missing history. A new OccSTeP benchmark with four disturbance regimes evaluates persistence, robustness, and action-conditioned rollout, where OccSTeP-WM achieves state-of-the-art gains in semantic mIoU and occupancy IoU. The work provides extensive ablations, practical implementation details, and an open-source release to facilitate robust, online world modeling for planning in dynamic driving environments.

Abstract

Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.

OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence

TL;DR

The paper introduces OccSTeP, a 4D occupancy spatio-temporal persistence framework for autonomous driving that addresses reactive and proactive forecasting. It proposes a tokenizer-free voxel world model, OccSTeP-WM, built on linear-time Mamba attention and an incremental spatio-temporal priors fusion with SE(3) warping to enable online, robust predictions under noisy or missing history. A new OccSTeP benchmark with four disturbance regimes evaluates persistence, robustness, and action-conditioned rollout, where OccSTeP-WM achieves state-of-the-art gains in semantic mIoU and occupancy IoU. The work provides extensive ablations, practical implementation details, and an open-source release to facilitate robust, online world modeling for planning in dynamic driving environments.

Abstract

Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.

Paper Structure

This paper contains 28 sections, 29 equations, 5 figures, 5 tables, 6 algorithms.

Figures (5)

  • Figure 1: Left: Overview of the 4D Occupancy Spatio-Temporal Persistence (OccSTeP) pipeline. For the first time, four challenging driving scenarios {Reverse, Discontinuous, Fragmentary, Reductive} are involved for benchmarking two tasks: (1) reactive forecasting "what will happen next"; (2) proactive forecasting "what would happen given a specific future action (e.g., turn left)". Right: The comparison results show that our OccSTeP-WM obtains more robust performance.
  • Figure 2: The proposed OccSTeP-WM framework ( \ref{['sec:occstep']}). Left: The pipeline is incrementally updating, which maintains a state to imply historical input. Right: The input of main module ("step") could perform either reactive ( \ref{['alg:reactive-occstep']}) or proactive \ref{['alg:proactive-occstep']}) forecasting. Between each "step", SE(3) warp was applied ( \ref{['sec:istpf']}). Morton order ( \ref{['eq:pi']}) is used for preserving locality.
  • Figure 3: Visualization of OccSTeP benchmark. The black rectangular body at the center of occupancy represents ego car.
  • Figure 4: Visualization of Occupancy World Model. Method* denotes Proactive pipeline.
  • Figure 5: Visualization of OccSTeP benchmark.