Table of Contents
Fetching ...

GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

William Ljungbergh, Adam Lilja, Adam Tonderski. Arvid Laveno Ling, Carl Lindström, Willem Verbeke, Junsheng Fu, Christoffer Petersson, Lars Hammarstrand, Michael Felsberg

TL;DR

GASP tackles scalable autonomous driving pre-training by learning a unified 4D occupancy representation that integrates geometric occupancy, ego-path traversal, and semantic features from a vision foundation model. By training on future lidar data, camera-derived features, and ego poses, it yields a rich, generalizable representation for downstream tasks such as semantic BEV forecasting, online mapping, and ego-trajectory prediction. Empirical results show consistent gains over a strong UnO baseline, especially under limited labeled data, and scaling experiments demonstrate data-efficient improvements across datasets. The work also introduces practical enhancements like rotation augmentation and missing-lidar-ray supervision, and provides open-source code to promote community adoption and further research.

Abstract

Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see \href{https://research.zenseact.com/publications/gasp/.

GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

TL;DR

GASP tackles scalable autonomous driving pre-training by learning a unified 4D occupancy representation that integrates geometric occupancy, ego-path traversal, and semantic features from a vision foundation model. By training on future lidar data, camera-derived features, and ego poses, it yields a rich, generalizable representation for downstream tasks such as semantic BEV forecasting, online mapping, and ego-trajectory prediction. Empirical results show consistent gains over a strong UnO baseline, especially under limited labeled data, and scaling experiments demonstrate data-efficient improvements across datasets. The work also introduces practical enhancements like rotation augmentation and missing-lidar-ray supervision, and provides open-source code to promote community adoption and further research.

Abstract

Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see \href{https://research.zenseact.com/publications/gasp/.

Paper Structure

This paper contains 26 sections, 7 equations, 18 figures, 15 tables.

Figures (18)

  • Figure 1: GASP~learns a structured, generalizable representation of the environment and its evolution and can be further trained to perform well on downstream AD tasks. We outperform SotA~pre-training~UnO~agro2024uno across the board, especially on primarily semantic tasks like map segmentation. No pre-training is displayed for reference. Downstream tasks requiring additional labels are post-trained using 1000 samples ($\sim$1% of pre-training scale).
  • Figure 2: Overview of GASP. Past lidar scans are encoded into a BEV feature map. These features are used by implicit decoders to predict DINOv2 features $\hat{\mathcal{D}}$, occupancy $\hat{\mathcal{O}}$, and ego-path $\hat{\mathcal{E}}$ at the query points $\mathcal{Q}$ generated from future sensor data during pre-training. We also show that the learned representation is useful when transferred to an array of downstream AD tasks.
  • Figure 3: Predicted occupancy (colored by depth and height respectively) and DINOv2 features (mapped to RGB using the three most important features) projected into camera views, as well as a holistic view from slightly above and behind the ego vehicle. Different type of objects such as road, vehicles, buildings, and trees have different features, indicating the model has semantic understanding of the objects in the scene. The injected white box represents the ego vehicle for clarity.
  • Figure 4: Predicted future VLM features from a Bird's Eye View. The model correctly predicts the car taking a right turn as well as those going straight through the crossing.
  • Figure 5: Semantic BEV forecasting AP (mean and std. dev) over the number of labeled training samples.
  • ...and 13 more figures