Table of Contents
Fetching ...

Self-Supervised Representation Learning with Joint Embedding Predictive Architecture for Automotive LiDAR Object Detection

Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Beiyao Sha, Anna Choromanska

TL;DR

This work tackles the labeling bottleneck in autonomous driving by introducing AD-L-JEPA, a joint embedding predictive architecture that learns LiDAR representations directly in BEV space. By applying modified BEV-guided masking, learnable empty/mask tokens, a lightweight predictor, and a VicReg-style variance regularization with a moving-average target encoder, the method avoids both generative reconstruction and contrastive pairs. The approach yields consistent improvements in LiDAR 3D object detection on KITTI3D, Waymo, and ONCE while drastically reducing pre-training GPU memory and compute. This JEPA-based SSL framework offers a more efficient and scalable pathway for self-supervised learning in autonomous driving, with open-source code planned.

Abstract

Recently, self-supervised representation learning relying on vast amounts of unlabeled data has been explored as a pre-training method for autonomous driving. However, directly applying popular contrastive or generative methods to this problem is insufficient and may even lead to negative transfer. In this paper, we present AD-L-JEPA, a novel self-supervised pre-training framework with a joint embedding predictive architecture (JEPA) for automotive LiDAR object detection. Unlike existing methods, AD-L-JEPA is neither generative nor contrastive. Instead of explicitly generating masked regions, our method predicts Bird's-Eye-View embeddings to capture the diverse nature of driving scenes. Furthermore, our approach eliminates the need to manually form contrastive pairs by employing explicit variance regularization to avoid representation collapse. Experimental results demonstrate consistent improvements on the LiDAR 3D object detection downstream task across the KITTI3D, Waymo, and ONCE datasets, while reducing GPU hours by 1.9x-2.7x and GPU memory by 2.8x-4x compared with the state-of-the-art method Occupancy-MAE. Notably, on the largest ONCE dataset, pre-training on 100K frames yields a 1.61 mAP gain, better than all other methods pre-trained on either 100K or 500K frames, and pre-training on 500K frames yields a 2.98 mAP gain, better than all other methods pre-trained on either 500K or 1M frames. AD-L-JEPA constitutes the first JEPA-based pre-training method for autonomous driving. It offers better quality, faster, and more GPU-memory-efficient self-supervised representation learning. The source code of AD-L-JEPA is ready to be released.

Self-Supervised Representation Learning with Joint Embedding Predictive Architecture for Automotive LiDAR Object Detection

TL;DR

This work tackles the labeling bottleneck in autonomous driving by introducing AD-L-JEPA, a joint embedding predictive architecture that learns LiDAR representations directly in BEV space. By applying modified BEV-guided masking, learnable empty/mask tokens, a lightweight predictor, and a VicReg-style variance regularization with a moving-average target encoder, the method avoids both generative reconstruction and contrastive pairs. The approach yields consistent improvements in LiDAR 3D object detection on KITTI3D, Waymo, and ONCE while drastically reducing pre-training GPU memory and compute. This JEPA-based SSL framework offers a more efficient and scalable pathway for self-supervised learning in autonomous driving, with open-source code planned.

Abstract

Recently, self-supervised representation learning relying on vast amounts of unlabeled data has been explored as a pre-training method for autonomous driving. However, directly applying popular contrastive or generative methods to this problem is insufficient and may even lead to negative transfer. In this paper, we present AD-L-JEPA, a novel self-supervised pre-training framework with a joint embedding predictive architecture (JEPA) for automotive LiDAR object detection. Unlike existing methods, AD-L-JEPA is neither generative nor contrastive. Instead of explicitly generating masked regions, our method predicts Bird's-Eye-View embeddings to capture the diverse nature of driving scenes. Furthermore, our approach eliminates the need to manually form contrastive pairs by employing explicit variance regularization to avoid representation collapse. Experimental results demonstrate consistent improvements on the LiDAR 3D object detection downstream task across the KITTI3D, Waymo, and ONCE datasets, while reducing GPU hours by 1.9x-2.7x and GPU memory by 2.8x-4x compared with the state-of-the-art method Occupancy-MAE. Notably, on the largest ONCE dataset, pre-training on 100K frames yields a 1.61 mAP gain, better than all other methods pre-trained on either 100K or 500K frames, and pre-training on 500K frames yields a 2.98 mAP gain, better than all other methods pre-trained on either 500K or 1M frames. AD-L-JEPA constitutes the first JEPA-based pre-training method for autonomous driving. It offers better quality, faster, and more GPU-memory-efficient self-supervised representation learning. The source code of AD-L-JEPA is ready to be released.
Paper Structure (33 sections, 4 equations, 13 figures, 17 tables)

This paper contains 33 sections, 4 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: AD‑L‑JEPA predicts directly in embedding space instead of explicitly reconstructing masked point clouds, as in driving scenes these can correspond to multiple plausible point clouds sharing the same semantics (e.g., "car rear") and can be encoded into the same embedding. It significantly boosts downstream performance while reducing pre-training time and GPU memory usage.
  • Figure 2: Overview of the AD-L-JEPA architecture: We introduce modified BEV-guided masking to mask the input point cloud in both empty and non-empty regions. The network predicts BEV embeddings at masked regions, leveraging variance regularization at non-empty regions following the output of the context encoder and the lightweight spatial predictor. It also employs a moving average update of the target encoder to learn diverse, high-level semantic representations.
  • Figure 3: Comparison of original BEV-guided masking lin2024bev with our modified version that creates masks in both empty and non-empty regions in non-overlapping BEV grids.
  • Figure 4: Masked region occupancy estimation evaluated by comparing BEV embeddings obtained by AD-L-JEPA with the learnable empty token via the cosine similarity. Unmasked regions are ignored and the cosine similarity in this case is represented in white color.
  • Figure 5: Sorted normalized singular values and the corresponding cumulative explained variance, obtained by singular value decomposition of pre-trained BEV embeddings. Embeddings are obtained either with AD-L-JEPA or Occupancy-MAE.
  • ...and 8 more figures