Table of Contents
Fetching ...

COME: Adding Scene-Centric Forecasting Control to Occupancy World Model

Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, Mengmeng Yang, Diange Yang

TL;DR

COME addresses the challenge of predicting 4D occupancy for autonomous driving by disentangling ego-motion from scene evolution through a scene-centric forecasting branch. It integrates scene priors into a diffusion-based occupancy model via a ControlNet, enabling ego-invariant yet geometrically consistent predictions. The approach yields state-of-the-art results on the Occ3D-nuScenes benchmark across multiple input modalities and horizons, and its ablations validate the benefits of the scene-centric module and controlled feature injection. This work demonstrates the value of disentangled representation learning for improving spatio-temporal fidelity in occupancy forecasting and provides a controllable framework for future extensions.

Abstract

World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene-centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego-irrelevant, spatially consistent future features through a scene-centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes-Occ3D dataset show that COME achieves consistent and significant improvements over state-of-the-art (SOTA) methods across diverse configurations, including different input sources (ground-truth, camera-based, fusion-based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26.3% better mIoU metric than DOME and 23.7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio-temporal prediction fidelity for world models. Code and videos will be available at https://github.com/synsin0/COME.

COME: Adding Scene-Centric Forecasting Control to Occupancy World Model

TL;DR

COME addresses the challenge of predicting 4D occupancy for autonomous driving by disentangling ego-motion from scene evolution through a scene-centric forecasting branch. It integrates scene priors into a diffusion-based occupancy model via a ControlNet, enabling ego-invariant yet geometrically consistent predictions. The approach yields state-of-the-art results on the Occ3D-nuScenes benchmark across multiple input modalities and horizons, and its ablations validate the benefits of the scene-centric module and controlled feature injection. This work demonstrates the value of disentangled representation learning for improving spatio-temporal fidelity in occupancy forecasting and provides a controllable framework for future extensions.

Abstract

World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene-centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego-irrelevant, spatially consistent future features through a scene-centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes-Occ3D dataset show that COME achieves consistent and significant improvements over state-of-the-art (SOTA) methods across diverse configurations, including different input sources (ground-truth, camera-based, fusion-based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26.3% better mIoU metric than DOME and 23.7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio-temporal prediction fidelity for world models. Code and videos will be available at https://github.com/synsin0/COME.

Paper Structure

This paper contains 27 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: COME with both scene-centric and ego-centric representation. Compared to ego-centric evolution, scene-centric prediction shows smaller gap in the context of temporal evolution. COME uses scene-centric prediction results as an important guidance to enhance occupancy world model.
  • Figure 2: The proposed COME framework comprises three main modules: (1) an Occupancy World Model that predicts future occupancy using historical observations and other inputs (e.g., poses, time steps, BEV layouts); (2) a Scene-centric Forecasting Module that produces spatially consistent scene predictions by eliminating the influence of ego motion; and (3) the COME ControlNet which converts the scene conditions from the forecasting module into control features that are subsequently injected into the world model for controllable and geometrically coherent occupancy generation.
  • Figure 3: Qualitative results of 3-s 4D occupancy generation.
  • Figure 4: Visualization examples demonstrate the pose control alignment ability of COME generation. For different driving commands such as Go Straight, Turn Left and Turn Right, COME well follows the pose control and generate similar scenarios compared to ground-truth.
  • Figure 5: Visualization examples of occupancy generation with BEV layouts. COME generates occupancy sequences that well follows the BEV layout control.
  • ...and 5 more figures