Table of Contents
Fetching ...

OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models

Tianran Liu, Shengwen Zhao, Mozhgan Pourkeshavarz, Weican Li, Nicholas Rhinehart

Abstract

Data-driven autonomous driving simulation has long been constrained by its heavy reliance on pre-recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open-ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model-driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego-actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large-scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an >80x improvement in stable generation length over previous state-of-the-art occupancy world models. OccSim is powered by two modules: W-DiT based static occupancy world model and the Layout Generator. W-DiT handles the ultra-long-horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre-train 4D semantic occupancy forecasting models to achieve up to 67% zero-shot performance on unseen data, outperforming previous asset-based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero-shot performance increases to about 74%, while the improvement over asset-based simulators expands to 22.1%.

OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models

Abstract

Data-driven autonomous driving simulation has long been constrained by its heavy reliance on pre-recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open-ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model-driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego-actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large-scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an >80x improvement in stable generation length over previous state-of-the-art occupancy world models. OccSim is powered by two modules: W-DiT based static occupancy world model and the Layout Generator. W-DiT handles the ultra-long-horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre-train 4D semantic occupancy forecasting models to achieve up to 67% zero-shot performance on unseen data, outperforming previous asset-based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero-shot performance increases to about 74%, while the improvement over asset-based simulators expands to 22.1%.

Paper Structure

This paper contains 30 sections, 6 equations, 9 figures, 3 tables, 5 algorithms.

Figures (9)

  • Figure 1: Comparison of OccSim and workflow. We overcome all of the mentioned drawback of previous method, with only 1 single frame and future actions, the W-DiT we proposed is able to build a multi-kilometer consistent occupancy map for autonomous driving simulation. The agent initial poses are generated via the layout generator and then forward-simulated. OccSim is compatible with multi-agent forward simulation methods as plug-and-play modules; in our experiments, we use an IDM variant kesting2010enhanced.
  • Figure 2: Illustration of structure to generate road map of $t+1$ from condition at timestep t. Here, $t$ represents the sequence frame index, and $\tau$ denotes the probability flow timestep. Different from classic temporal concatenation in the input, the core insight of this paradigm design is to transform temporal generation into scene completion at a single time point. This token-wise scale and shift condition injection method allow us can precisely control spatial rigid transformation during long-horizon generation. $I(-)$ stand for rasterization process of given future trajectory $J_{t+1}^{t+n}$. The latent representations $z_t$ and $z_{t+1}$ ($\in \mathbb{R}^{H \times W \times C_z}$) at bottom right corner are visualized as decoded BEV occupancy maps for intuitive illustration.
  • Figure 3: Agent initialization within generated static maps. Each panel displays a 200x200 voxel crop centered on the (unrendered) ego-vehicle from $\mathcal{M}_{\text{global}}$. The first two panels show distinct generated scenes, the third and fourth panels demonstrate the model's multimodality: two different plausible layouts generated for the same revisited location. Orange vehicles (second panel) are stationary.
  • Figure 4: Realism evaluation of generated 3D static occupancy. The left three columns report the conditional fidelity (MMD and KID) under a 30-frame constraint. The right two columns showcase the unconditional realism (KID and FID) over a 1000-frame long-horizon rollout. Our W-DiT-based method obtains the best long-horizon stability and conditional realism.
  • Figure 5: Long-horizon 2D realism under varying trajectories. FID and KID comparisons for 2D projections of the 3D generated occupancy. We evaluate three ego-actions: straight, closed-loop, and continuous turning. The dashed reference line indicates the "chaos" threshold characterizing scene collapse. In all scenarios, our model (red) remains strictly below this threshold, vastly outperforming the previous SOTA model(which exceed the "chaos" threshold in 35/40-th frame generated) and ensuring long-term structural integrity regardless of the driving action.
  • ...and 4 more figures