Table of Contents
Fetching ...

Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting

Haoran Zhu, Anna Choromanska

TL;DR

Self-supervised world modeling for autonomous driving remains challenging due to the cost of pixel-level generation and risk of latent representation collapse. This work introduces AD-LiST-JEPA, a two-phase JEPA-based framework that learns latent spatiotemporal representations from multi-frame LiDAR data and evaluates them via downstream occupancy completion and forecasting (OCF). It introduces a group BEV-guided masking strategy to separate ego-motion from scene content and demonstrates that a pretrained encoder improves OCF performance on Waymo, with SIGReg regularization further boosting results. The findings support scalable latent LiDAR world models and point to future work with larger models and datasets.

Abstract

Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textit{world models} that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textit{joint-embedding predictive architecture (JEPA)} enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbf{AD-LiST-JEPA}, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.

Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting

TL;DR

Self-supervised world modeling for autonomous driving remains challenging due to the cost of pixel-level generation and risk of latent representation collapse. This work introduces AD-LiST-JEPA, a two-phase JEPA-based framework that learns latent spatiotemporal representations from multi-frame LiDAR data and evaluates them via downstream occupancy completion and forecasting (OCF). It introduces a group BEV-guided masking strategy to separate ego-motion from scene content and demonstrates that a pretrained encoder improves OCF performance on Waymo, with SIGReg regularization further boosting results. The findings support scalable latent LiDAR world models and point to future work with larger models and datasets.

Abstract

Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textit{world models} that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textit{joint-embedding predictive architecture (JEPA)} enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbf{AD-LiST-JEPA}, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.
Paper Structure (14 sections, 1 equation, 4 figures, 1 table, 2 algorithms)

This paper contains 14 sections, 1 equation, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure 1: Proposed group BEV-guided masking.
  • Figure 2: Overview of the Phase 1 framework with variance regularization (left) and SIGReg regularization (right) variants (best viewed when zoomed in).
  • Figure 3: Raw past (including currrent) point cloud visualization as the neural network input from time -5 to 0. The color represents height relative to the ground.
  • Figure 4: Processed future (including current) completed occupancy visualization as the neural network prediction target from time 0 to 5. The color represents height relative to the ground.