Table of Contents
Fetching ...

OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

Bu Jin, Songen Gu, Xiaotao Hu, Yupeng Zheng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Wei Yin

Abstract

In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from \textbf{inefficiency}, \textbf{temporal degradation} in long-term generation and \textbf{lack of controllability}. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a \textbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.

OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

Abstract

In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from \textbf{inefficiency}, \textbf{temporal degradation} in long-term generation and \textbf{lack of controllability}. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a \textbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.

Paper Structure

This paper contains 20 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We propose OccTENS, a coarse-to-fine occupancy world model that enables controllable, high-fidelity long-time occupancy generation while maintaining computational efficiency.
  • Figure 2: Overview of OccTENS. OccTENS consists of two components: two tokenizers (a and c) that encode 3D occupancy and ego motion into discrete tokens, and a generative world model (b) using temporal next-scale prediction for future 3D scene forecasting.
  • Figure 3: Temporal Next-scale Prediction. The proposed TENSFormer decomposes the sequential occupancy generation into two distinct components: a scene-by-scene prediction and a scale-by-scale generation. The $\dot{F}$, $\ddot{F}$, $\hat{F}$ denote intermediate representations after the scale-wise temporal causal layer, after the frame-wise spatial layer and predicted logits for the next frame, respectively.
  • Figure 4: Qualitative results for long-term generation. We compare OccTENS with OccWorld zheng2023occworld in generating long sequences. OccWorld exhibits repetition artifacts. In contrast, OccTENS produces more diverse and realistic occupancy scenes. We mark the ego vehicle with an orange circle in the first column.
  • Figure 5: Qualitative results of pose controllability. OccTENS successfully generates results aligned with the pose input like turning (top) or changing lane (bottom), indicating the superior controllability of OccTENS.