Table of Contents
Fetching ...

RenderWorld: World Model with Self-Supervised 3D Label

Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma

TL;DR

RenderWorld tackles vision-only autonomous driving by generating 3D occupancy labels from 2D multi-view images via a self-supervised Img2Occ module using Gaussian Splatting, then encoding these labels with AM-VAE for compact, fine-grained scene representations. A world model with a self-supervised tokenizer and a hierarchical transformer forecasts 4D scene evolution and supports motion planning, all while reducing GPU memory compared to NeRF-based methods. The approach achieves state-of-the-art performance in 3D occupancy prediction, strong 4D forecasting, and competitive planning on NuScenes, demonstrating the practicality of pure-vision pipelines for perception, forecasting, and decision making. The combination of Gaussian-based occupancy, air/non-air separation, and a two-stage training regime yields improved accuracy and efficiency, highlighting its potential for real-time autonomous driving deployments.

Abstract

End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.

RenderWorld: World Model with Self-Supervised 3D Label

TL;DR

RenderWorld tackles vision-only autonomous driving by generating 3D occupancy labels from 2D multi-view images via a self-supervised Img2Occ module using Gaussian Splatting, then encoding these labels with AM-VAE for compact, fine-grained scene representations. A world model with a self-supervised tokenizer and a hierarchical transformer forecasts 4D scene evolution and supports motion planning, all while reducing GPU memory compared to NeRF-based methods. The approach achieves state-of-the-art performance in 3D occupancy prediction, strong 4D forecasting, and competitive planning on NuScenes, demonstrating the practicality of pure-vision pipelines for perception, forecasting, and decision making. The combination of Gaussian-based occupancy, air/non-air separation, and a two-stage training regime yields improved accuracy and efficiency, highlighting its potential for real-time autonomous driving deployments.

Abstract

End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.
Paper Structure (13 sections, 20 equations, 3 figures, 6 tables)

This paper contains 13 sections, 20 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: General pipeline of RenderWorld. We firstly generate the 3D occupancy labels through an Img2Occ Module (Figure \ref{['fig:gs-pipeline']}). Then, using Air Mask Variational Autoencoder (AM-VAE) described in Section \ref{['sub:AMVAE']}, the separated air and non-air voxels are independently encoded into latent representations (i.e., discrete tokens). Finally, these latent representations are processed according to the specifications in Section \ref{['sub:worldmodel']}, and based on this, the voxels and trajectories are predicted, ultimately outputting the predicted occupancy and self-planning.
  • Figure 2: Training paradigm of 2D-to-3D occupancy prediction Module. Our proposed Img2Occ Module utilizes 2D labels to train the 3D occupancy network that allowing the model to take advantage of detailed 2D pixel-level semantics and depth supervision.
  • Figure 3: Visualization of the forecasting and planning results of RenderWorld.