Table of Contents
Fetching ...

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Jiayuan Du, Yiming Zhao, Zhenglong Guo, Yong Pan, Wenbo Hou, Zhihui Hao, Kun Zhan, Qijun Chen

TL;DR

This work tackles 4D occupancy forecasting for autonomous driving by eliminating discrete tokenization and BEV priors. It introduces SparseWorld-TC, a trajectory-conditioned end-to-end transformer that represents scenes with sparse anchors and uses deformable attention to fuse multi-view sensor data, past context, and future trajectory information. Through random horizon training and Chamfer/semantic losses, the model achieves state-of-the-art 1–3s occupancy forecasting on Occ3D-nuScenes and strong long-term results, while remaining robust to different trajectory conditions. The approach demonstrates scalable, trajectory-guided predictions with potential extensions to alternative representations like 3D Gaussians and direct sensor prediction. This advances practical, end-to-end 4D occupancy forecasting for planning and safety assessment in dynamic environments.

Abstract

This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

TL;DR

This work tackles 4D occupancy forecasting for autonomous driving by eliminating discrete tokenization and BEV priors. It introduces SparseWorld-TC, a trajectory-conditioned end-to-end transformer that represents scenes with sparse anchors and uses deformable attention to fuse multi-view sensor data, past context, and future trajectory information. Through random horizon training and Chamfer/semantic losses, the model achieves state-of-the-art 1–3s occupancy forecasting on Occ3D-nuScenes and strong long-term results, while remaining robust to different trajectory conditions. The approach demonstrates scalable, trajectory-guided predictions with potential extensions to alternative representations like 3D Gaussians and direct sensor prediction. This advances practical, end-to-end 4D occupancy forecasting for planning and safety assessment in dynamic environments.

Abstract

This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

Paper Structure

This paper contains 22 sections, 6 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Without VAE codebook and BEV representations, the model represents future multi-frame occupancy using sparse queries and leverages image evidence from prior frames to convert randomly sampled 3D points into reliable future occupancy.
  • Figure 2: Occupancy is modeled as a collection of anchors formed by a bundle of 3D points paired with an occupancy embedding. The embedding is decoded by two MLPs to yield per point offsets and semantic class logits, which denoise sampled points to a consistent occupancy field.
  • Figure 3: We embed the sensor observations, occupancy priors, and trajectories into feature vectors. These embeddings pass through stacked frame-level attention and temporal attention blocks that iteratively fuse cross-modal and cross-time information into occupancy features. Those occupancy features finally decode random points into meaningful occupancy predictions.
  • Figure 4: Qualitative results of our proposed SparseWorld-TC are presented here. Our method effectively captures both dynamic and static changes in the surrounding environment over a 3-second forecasting horizon.
  • Figure 5: Long-term 4D occupancy world model forecasting.
  • ...and 6 more figures