Table of Contents
Fetching ...

TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception

Runjian Chen, Hyoungseob Park, Bo Zhang, Wenqi Shao, Ping Luo, Alex Wong

TL;DR

TREND addresses the high labeling cost of LiDAR by learning unsupervised 3D representations from temporal sequences. It introduces a Recurrent Embedding to propagate embeddings across time conditioned on ego-vehicle actions and a Temporal Neural Field with differentiable rendering to forecast future LiDAR frames. Across NuScenes, Once, and Waymo, TREND achieves state-of-the-art improvements over previous unsupervised pre-training methods and demonstrates enhanced sample efficiency in few-shot and transfer settings. This temporal forecasting approach yields more semantically meaningful 3D representations, improving downstream 3D object detection and segmentation tasks.

Abstract

Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Almost all existing work focus on a single frame of LiDAR point cloud and neglect the temporal LiDAR sequence, which naturally accounts for object motion (and their semantics). Instead, we propose TREND, namely Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. Unlike existing work that follows conventional contrastive learning or masked auto encoding paradigms, TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embedding across time and a Temporal Neural Field to represent the 3D scene, through which we compute the loss using differentiable rendering. To our best knowledge, TREND is the first work on temporal forecasting for unsupervised 3D representation learning. We evaluate TREND on downstream 3D object detection tasks on popular datasets, including NuScenes, Once and Waymo. Experiment results show that TREND brings up to 90% more improvement as compared to previous SOTA unsupervised 3D pre-training methods and generally improve different downstream models across datasets, demonstrating that indeed temporal forecasting brings improvement for LiDAR perception. Codes and models will be released.

TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception

TL;DR

TREND addresses the high labeling cost of LiDAR by learning unsupervised 3D representations from temporal sequences. It introduces a Recurrent Embedding to propagate embeddings across time conditioned on ego-vehicle actions and a Temporal Neural Field with differentiable rendering to forecast future LiDAR frames. Across NuScenes, Once, and Waymo, TREND achieves state-of-the-art improvements over previous unsupervised pre-training methods and demonstrates enhanced sample efficiency in few-shot and transfer settings. This temporal forecasting approach yields more semantically meaningful 3D representations, improving downstream 3D object detection and segmentation tasks.

Abstract

Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Almost all existing work focus on a single frame of LiDAR point cloud and neglect the temporal LiDAR sequence, which naturally accounts for object motion (and their semantics). Instead, we propose TREND, namely Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. Unlike existing work that follows conventional contrastive learning or masked auto encoding paradigms, TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embedding across time and a Temporal Neural Field to represent the 3D scene, through which we compute the loss using differentiable rendering. To our best knowledge, TREND is the first work on temporal forecasting for unsupervised 3D representation learning. We evaluate TREND on downstream 3D object detection tasks on popular datasets, including NuScenes, Once and Waymo. Experiment results show that TREND brings up to 90% more improvement as compared to previous SOTA unsupervised 3D pre-training methods and generally improve different downstream models across datasets, demonstrating that indeed temporal forecasting brings improvement for LiDAR perception. Codes and models will be released.

Paper Structure

This paper contains 16 sections, 13 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Different schemes for unsupervised 3D representation learning. (a) Masked Autoencoding first applies random masked on current LiDAR point cloud and then pre-train 3D backbones with a reconstruction objective. (b) Contrastive-based methods build up different views of current point cloud and pre-train the networks by pulling together positive pairs and pushing away negative pairs. (c) Our proposed TREND explores object motion and semantic information in LiDAR sequence and introduces temporal forecasting for unsupervised 3D pre-training.
  • Figure 2: The pipeline of TREND. "S.E." means sinusoidal encoding positional_encoding_1positional_encoding_2. To pre-train the encoder $f^{\text{enc}}$ via temporal forecasting in an unsupervised manner, TREND first generate 3D embeddings at different timestamps with a recurrent embedding scheme as shown in part (a). Action embeddings are computed with sinusoidal encoding and projected by an Multi-layer Perceptron. Then the action embeddings are repeated and concatenated with embeddings from previous timestamp, followed by a shared shallow 3D convolution $f^{\text{3D}}$ to generate 3D embeddings for timestamp $t_1$, $t_2$, ... Then as described in part (b), a Temporal Neural Field is utilized to represent the 3D scene at different timestamps. We query features of the sampled points along LiDAR rays and concatenate them with sinusoidal embeddings of timestamps as well as the position of the sampled points to feed into a signed distance function neussdf_1sdf_2$f^{\text{SDF}}$ for signed distance value prediction. Next, we conduct differentiable rendering to aggregate the sampled points along each ray and predict the ranges in the direction of the ray, that is reconstructing and forecasting the LiDAR point clouds at different timestamps. Finally we compute the pre-training loss with the predicted LiDAR point clouds and the actual LiDAR sequence.