GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving
William Ljungbergh, Adam Lilja, Adam Tonderski. Arvid Laveno Ling, Carl Lindström, Willem Verbeke, Junsheng Fu, Christoffer Petersson, Lars Hammarstrand, Michael Felsberg
TL;DR
GASP tackles scalable autonomous driving pre-training by learning a unified 4D occupancy representation that integrates geometric occupancy, ego-path traversal, and semantic features from a vision foundation model. By training on future lidar data, camera-derived features, and ego poses, it yields a rich, generalizable representation for downstream tasks such as semantic BEV forecasting, online mapping, and ego-trajectory prediction. Empirical results show consistent gains over a strong UnO baseline, especially under limited labeled data, and scaling experiments demonstrate data-efficient improvements across datasets. The work also introduces practical enhancements like rotation augmentation and missing-lidar-ray supervision, and provides open-source code to promote community adoption and further research.
Abstract
Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see \href{https://research.zenseact.com/publications/gasp/.
