Table of Contents
Fetching ...

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

Ben Agro, Quinlan Sykora, Sergio Casas, Thomas Gilles, Raquel Urtasun

TL;DR

UnO tackles perception and forecasting for autonomous driving by learning an unsupervised, continuous 4D occupancy field from unlabeled LiDAR data. It encodes past scans into a BEV feature map and uses a lightweight implicit decoder to query occupancy at any space-time point, trained with self-supervision from future LiDAR via occupancy-based pseudo-labels. The learned 4D occupancy representation transfers effectively to downstream tasks: point-cloud forecasting through a lightweight depth renderer and BEV semantic occupancy forecasting via few-shot fine-tuning, achieving state-of-the-art or strong few-shot performance and improved recall on dynamic, small, or rare objects. This approach demonstrates that unsupervised world models can yield rich geometry, dynamics, and semantics, enabling scalable perception and forecasting with unlabeled data and potentially advancing safety in self-driving systems.

Abstract

Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world -- traditionally with object detections and trajectory predictions, or temporal bird's-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labeled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

TL;DR

UnO tackles perception and forecasting for autonomous driving by learning an unsupervised, continuous 4D occupancy field from unlabeled LiDAR data. It encodes past scans into a BEV feature map and uses a lightweight implicit decoder to query occupancy at any space-time point, trained with self-supervision from future LiDAR via occupancy-based pseudo-labels. The learned 4D occupancy representation transfers effectively to downstream tasks: point-cloud forecasting through a lightweight depth renderer and BEV semantic occupancy forecasting via few-shot fine-tuning, achieving state-of-the-art or strong few-shot performance and improved recall on dynamic, small, or rare objects. This approach demonstrates that unsupervised world models can yield rich geometry, dynamics, and semantics, enabling scalable perception and forecasting with unlabeled data and potentially advancing safety in self-driving systems.

Abstract

Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world -- traditionally with object detections and trajectory predictions, or temporal bird's-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labeled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.
Paper Structure (60 sections, 10 equations, 18 figures, 3 tables)

This paper contains 60 sections, 10 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: We present UnO, a world model that learns to predict 3D occupancy (a) over time from unlabeled data. This model can be easily and effectively transferred to downstream tasks like point cloud forecasting (b), and bird's-eye view semantic occupancy (c).
  • Figure 2: UnO's occupancy pseudo-labels: a laser beam emitted from sensor position $\textcolor{blue}{s_i}$ at time $\textcolor{blue}{t_{ij}}$ returns the point $\textcolor{mypink1}{p_{ij}}$, meaning that the ray segment $\textcolor{red}{\mathcal{R}_{ij}^{-}}$ is unoccupied space and the segment within a buffer $\delta$ after the lidar return is occupied space $\textcolor{green}{\mathcal{R}_{ij}^{+}}$.
  • Figure 3: An overview of our method, UnO. The past LiDAR is voxelized and encoded into a BEV feature map which is used by an implicit occupancy decoder to predict occupancy $\hat{\mathcal{O}}$ at query points $\mathcal{Q}$. During training the query points and occupancy pseudo-labels are generated from future LiDAR data. At inference, the model can be queried at any $(x, y, z, t)$ point. Refer to \ref{['fig:objective']} for details on the query generation process.
  • Figure 4: A visualization of UnO on two different examples. We have labeled observations of note: (A) prediction of a right-turning vehicle, (B) object extent with only a partial viewpoint from the LiDAR data, (C) prediction of moving vehicle where spreading occupancy represents uncertainty in future acceleration, (D) prediction of walking pedestrians on the sidewalk, (E) prediction of a vehicle lane changing around a parked car, (F) persistent point cloud predictions on the lane-changing vehicle, (G) perceiving small objects like cones.
  • Figure 5: BEV semantic occupancy results. Fine-tuning UnO vs. SOTA supervised methods across different scales of supervision.
  • ...and 13 more figures