Table of Contents
Fetching ...

T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning

Weijie Wei, Fatemeh Karimi Nejadasl, Theo Gevers, Martin R. Oswald

TL;DR

This work tackles the data bottleneck in LiDAR point-cloud understanding by leveraging temporal information in sequences. It introduces Temporal Masked Autoencoders (T-MAE) with a SiamWCA backbone that fuses a fully observed past frame with a masked current frame through windowed cross-attention to learn temporal dependencies via a reconstruction objective with a masking ratio of $0.75$. The approach yields state‑of‑the‑art results on Waymo and ONCE among self‑supervised methods, with significantly fewer finetuning iterations and substantial gains for pedestrians, and shows strong transferability across datasets. Overall, T-MAE reduces the annotation burden for autonomous-driving perception by providing a scalable, temporally aware SSL framework for sparse LiDAR data.

Abstract

The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. A SiamWCA backbone, containing a Siamese encoder and a windowed cross-attention (WCA) module, is established for the two-frame input. Considering that the movement of an ego-vehicle alters the view of the same instance, temporal modeling also serves as a robust and natural data augmentation, enhancing the comprehension of target objects. SiamWCA is a powerful architecture but heavily relies on annotated data. Our T-MAE pre-training strategy alleviates its demand for annotated data. Comprehensive experiments demonstrate that T-MAE achieves the best performance on both Waymo and ONCE datasets among competitive self-supervised approaches. Codes will be released at https://github.com/codename1995/T-MAE

T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning

TL;DR

This work tackles the data bottleneck in LiDAR point-cloud understanding by leveraging temporal information in sequences. It introduces Temporal Masked Autoencoders (T-MAE) with a SiamWCA backbone that fuses a fully observed past frame with a masked current frame through windowed cross-attention to learn temporal dependencies via a reconstruction objective with a masking ratio of . The approach yields state‑of‑the‑art results on Waymo and ONCE among self‑supervised methods, with significantly fewer finetuning iterations and substantial gains for pedestrians, and shows strong transferability across datasets. Overall, T-MAE reduces the annotation burden for autonomous-driving perception by providing a scalable, temporally aware SSL framework for sparse LiDAR data.

Abstract

The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. A SiamWCA backbone, containing a Siamese encoder and a windowed cross-attention (WCA) module, is established for the two-frame input. Considering that the movement of an ego-vehicle alters the view of the same instance, temporal modeling also serves as a robust and natural data augmentation, enhancing the comprehension of target objects. SiamWCA is a powerful architecture but heavily relies on annotated data. Our T-MAE pre-training strategy alleviates its demand for annotated data. Comprehensive experiments demonstrate that T-MAE achieves the best performance on both Waymo and ONCE datasets among competitive self-supervised approaches. Codes will be released at https://github.com/codename1995/T-MAE
Paper Structure (22 sections, 1 equation, 11 figures, 16 tables)

This paper contains 22 sections, 1 equation, 11 figures, 16 tables.

Figures (11)

  • Figure 1: T-MAE performance on Waymo pSun_2020_Waymo.Left: Each point triplet shows the performance differences to three models finetuned with the same data. The triplets show finetuned models with 8K, 16K, 32K labeled frames (left to right). Our T-MAE pre-training outperforms both random initialization and the SOTA SSL method MV-JAR xu_mv-jar_2023 with significantly fewer iterations. Right: T-MAE yields higher mAPH for pedestrians when finetuned with half the labeled data than MV-JAR.
  • Figure 2: Comparison between single- and four-frame concatenation. While simple frame concatenation generally improves point density and detection rates, it can introduce spurious points in non-static scene parts that may degrade the detection performance. Since we combine consecutive frames via learned cross-attention, our approach is less affected by this problem. The blue bounding boxes indicate the ground truth for the current frame.
  • Figure 3: Overview of our architecture and the proposed T-MAE pre-training. Two frames are sampled from a sequence of point clouds and are voxelized. During pre-training, the current frame $\mathcal{P}^{t_2}$ undergoes an additional masking process. Note that the dashed boxes indicate operations for pre-training phase only. Next, voxel-wise tokens are computed by a Siamese encoder. The two-way gray arrow indicates weight sharing. The WCA module takes as input the full tokens of the previous frame and the partial observation of the current frame and outputs enhanced tokens. The dense feature recovery places sparse tokens back to a dense feature map and convolves the map to fill empty locations. Subsequently, the feature map is either fed to a reconstruction head that recovers masked points, or to a detection head predicting bounding boxes.
  • Figure 4: Windowed sparse cross-attention (WCA). Given the input tokens from both $\mathcal{P}^{t_1}$ and $\mathcal{P}^{t_2}$, a joint token grouping is performed to obtain a window partition. A sparse regional cross-attention (SRCA) is performed independently in each window to integrate the historical information to the middle tokens of the current frame. In other words, the tokens from two frames but with the same colors are attending to each other. For simplicity, the information flow is only depicted for the green tokens. After the second joint token grouping, the cross-attention are performed once more with the shifted window partition. The red dot () indicates the ego-vehicle driving towards the right. The box with diagonal stripes () represents an object, e.g., a vehicle, moving towards the left. Best viewed in color and high-resolution.
  • Figure 5: Qualitative results. We depict ground truth and predictions as boxes colored in red and green for two exemplary scenes from the Waymo dataset pSun_2020_Waymo.
  • ...and 6 more figures