Table of Contents
Fetching ...

Enhanced Spatiotemporal Consistency for Image-to-LiDAR Data Pretraining

Xiang Xu, Lingdong Kong, Hui Shuai, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Qingshan Liu

TL;DR

SuperFlow++ tackles the data-hungry nature of LiDAR perception by embedding spatiotemporal cues into image-to-LiDAR pretraining and downstream tasks. It introduces four components—view consistency alignment, dense-to-sparse consistency regularization, flow-based contrastive learning, and temporal voting—to capture temporal dynamics across consecutive LiDAR–camera frames. Extensive experiments across 11 heterogeneous datasets show superior performance and reveal emergent scaling properties when enlarging 2D and 3D backbones, as well as robustness in semi-supervised settings. The work advances data-efficient, temporally aware 3D perception for autonomous driving and provides a roadmap for scalable multi-modal 3D foundation models.

Abstract

LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow

Enhanced Spatiotemporal Consistency for Image-to-LiDAR Data Pretraining

TL;DR

SuperFlow++ tackles the data-hungry nature of LiDAR perception by embedding spatiotemporal cues into image-to-LiDAR pretraining and downstream tasks. It introduces four components—view consistency alignment, dense-to-sparse consistency regularization, flow-based contrastive learning, and temporal voting—to capture temporal dynamics across consecutive LiDAR–camera frames. Extensive experiments across 11 heterogeneous datasets show superior performance and reveal emergent scaling properties when enlarging 2D and 3D backbones, as well as robustness in semi-supervised settings. The work advances data-efficient, temporally aware 3D perception for autonomous driving and provides a roadmap for scalable multi-modal 3D foundation models.

Abstract

LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at https://github.com/Xiangxu-0103/SuperFlow

Paper Structure

This paper contains 20 sections, 7 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparisons of different superpixels. (a) Class-agnostic superpixels generated by the unsupervised SLIC achanta2012slic algorithm. (b) Class-agnostic semantic superpixels generated by vision foundation models (VFMs) zou2023seemzhang2023openSeeDzou2023xdecoder. (c) View-consistent semantic superpixels generated by our view consistency alignment module.
  • Figure 2: Dense-to-sparse (D2S) Consistency Regularization Module. Dense point clouds are generated by aggregating multi-sweep LiDAR scans captured over a defined time window. The D2S regularization enforces consistency between the dense and sparse point clouds, improving the model's ability to learn robust and detailed representations.
  • Figure 3: Flow-based contrastive learning (FCL) pipeline. The FCL pipeline processes multiple LiDAR-camera pairs captured over consecutive scans. It leverages temporally aligned semantic superpixels to define three contrastive learning objectives: (1) Spatial Contrastive Learning, which enforces feature consistency between LiDAR and camera modalities within the same frame, (2) Intra-Sensor Temporal Contrastive Learning, which aligns features across consecutive LiDAR scans, ensuring temporal coherence in dynamic scenes, and (3) Cross-Sensor Temporal Contrastive Learning, which aligns LiDAR features with temporally adjacent camera frames to enhance cross-modal consistency. This approach improves multi-modal representation learning by integrating spatial and temporal consistency.
  • Figure 4: Qualitative assessments of state-of-the-art pretraining methods pretrained on nuScenescaesar2020nuScenes and fine-tuned on the Waymo Opensun2020waymoOpen dataset, with $1\%$ annotations. The error maps show the correct and incorrect predictions in gray and red, respectively. Best viewed in colors and zoomed-in for details.
  • Figure 5: The qualitative results of object detection trained with 5% labeled data. The first row shows the model trained with random initialization, while the second row displays results from our proposed framework. The groundtruth / predicted results are highlighted with blue / red boxes, respectively. Best viewed in colors and zoomed-in for additional details.
  • ...and 1 more figures