Table of Contents
Fetching ...

GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction

Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, Jiwen Lu

TL;DR

Problem: vision-based 3D occupancy prediction benefits from temporal context but prior methods either fuse frames or ignore scene continuity. Approach: GaussianWorld uses a 3D Gaussian scene representation and a world-model that forecasts $4D$ occupancy conditioned on the current RGB observation, explicitly modeling ego-motion alignment, dynamic object motion, and completion of newly observed areas via evolution and refinement layers. Contributions: introduces a unified evolution layer, a streaming training regime with progressive sequence lengths and probabilistic frame dropout, and achieves state-of-the-art results on $nuScenes$ with minimal overhead. Significance: enables efficient, accurate streaming 3D perception for autonomous driving with explicit, interpretable scene evolution.

Abstract

3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings. To incorporate sequential inputs, most existing methods fuse representations from previous frames to infer the current 3D occupancy. However, they fail to consider the continuity of driving scenarios and ignore the strong prior provided by the evolution of 3D scenes (e.g., only dynamic objects move). In this paper, we propose a world-model-based framework to exploit the scene evolution for perception. We reformulate 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input. We decompose the scene evolution into three factors: 1) ego motion alignment of static scenes; 2) local movements of dynamic objects; and 3) completion of newly-observed scenes. We then employ a Gaussian world model (GaussianWorld) to explicitly exploit these priors and infer the scene evolution in the 3D Gaussian space considering the current RGB observation. We evaluate the effectiveness of our framework on the widely used nuScenes dataset. Our GaussianWorld improves the performance of the single-frame counterpart by over 2% in mIoU without introducing additional computations. Code: https://github.com/zuosc19/GaussianWorld.

GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction

TL;DR

Problem: vision-based 3D occupancy prediction benefits from temporal context but prior methods either fuse frames or ignore scene continuity. Approach: GaussianWorld uses a 3D Gaussian scene representation and a world-model that forecasts occupancy conditioned on the current RGB observation, explicitly modeling ego-motion alignment, dynamic object motion, and completion of newly observed areas via evolution and refinement layers. Contributions: introduces a unified evolution layer, a streaming training regime with progressive sequence lengths and probabilistic frame dropout, and achieves state-of-the-art results on with minimal overhead. Significance: enables efficient, accurate streaming 3D perception for autonomous driving with explicit, interpretable scene evolution.

Abstract

3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings. To incorporate sequential inputs, most existing methods fuse representations from previous frames to infer the current 3D occupancy. However, they fail to consider the continuity of driving scenarios and ignore the strong prior provided by the evolution of 3D scenes (e.g., only dynamic objects move). In this paper, we propose a world-model-based framework to exploit the scene evolution for perception. We reformulate 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input. We decompose the scene evolution into three factors: 1) ego motion alignment of static scenes; 2) local movements of dynamic objects; and 3) completion of newly-observed scenes. We then employ a Gaussian world model (GaussianWorld) to explicitly exploit these priors and infer the scene evolution in the 3D Gaussian space considering the current RGB observation. We evaluate the effectiveness of our framework on the widely used nuScenes dataset. Our GaussianWorld improves the performance of the single-frame counterpart by over 2% in mIoU without introducing additional computations. Code: https://github.com/zuosc19/GaussianWorld.

Paper Structure

This paper contains 15 sections, 13 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: While single-frame 3D occupancy prediction methods demonstrate strong performance occformersurroundocc, the incorporation of temporal information can further improve the results cvt-occ. However, most existing methods fuse past scene representations bevformerfb-occ to infer the current 3D occupancy, which ignores the continuity of driving scenarios and introduces additional computations. Differently, we propose a world-model-based framework for streaming 3D occupancy prediction and explicitly model the scene evolutions using the current camera observations as inputs. Our framework improves the performance of existing methods without additional computation overhead.
  • Figure 2: Framework of our GaussianWorld for streaming 3D semantic occupancy prediction. As the ego vehicle shifts from the last frame to the current frame, we first align historical Gaussians to the current time and complete newly-observed areas with random Gaussians. We then utilize several Gaussian world layers composed of self-encoding, cross-attention, and unified refinement blocks to simultaneously predict the development of historical Gaussians and the properties of completed Gaussians. The refined Gaussians can model the scene evolution and generate the current occupancy.
  • Figure 3: Illustration of the three decomposed factors of scene evolution. We decompose the scene evolution into three key factors: ego-motion alignment of static scenes, local movements of dynamic objects, and the completion of newly-observed areas.
  • Figure 4: Illustration of the proposed unified refinement block. We employ the perception mode to update all attributes of newly completed Gaussians. We employ the motion mode to predict the evolution of historical Gaussians, where only positions of dynamic Gaussians are updated in the evolution layer $E_{vol}$ and all attributes of historical Gaussians are updated in the refinement layer $R_{efine}$.
  • Figure 5: Performance of streaming occupancy prediction with different sequence lengths. We also show the performance of using different numbers of refinement layers.
  • ...and 2 more figures