Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

Yichen Xie; Hongge Chen; Gregory P. Meyer; Yong Jae Lee; Eric M. Wolff; Masayoshi Tomizuka; Wei Zhan; Yuning Chai; Xin Huang

Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

Yichen Xie, Hongge Chen, Gregory P. Meyer, Yong Jae Lee, Eric M. Wolff, Masayoshi Tomizuka, Wei Zhan, Yuning Chai, Xin Huang

TL;DR

Cohere3D addresses the key challenge of recovering accurate 3D object states from vision in autonomous driving by learning long-term, instance-level temporal coherence. It constructs LiDAR-guided long-term correspondences across frames to supervise BEV-based instance representations through a depth-aware, contrastive learning objective in a Siamese online/target setup with memory. The approach yields improved data efficiency and downstream performance across 3D detection, HD map construction, motion prediction, and end-to-end planning, while enabling robust temporal fusion under changing viewpoints. By leveraging unlabeled LiDAR data to guide temporal associations, Cohere3D provides a practical pretraining paradigm that enhances vision-based perception and planning tasks in real-world driving scenarios.

Abstract

Due to the lack of depth cues in images, multi-frame inputs are important for the success of vision-based perception, prediction, and planning in autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D image inputs if we can identify the same instance in different input frames. However, the dynamic nature of autonomous driving scenes leads to significant changes in the appearance and shape of each instance captured by the camera at different time steps. To this end, we propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence robust to the change in distance and perspective. The learned representation aids in instance-level correspondence across multiple input frames in downstream tasks. In the pretraining stage, the raw point clouds from LiDAR sensors are utilized to construct the long-term temporal correspondence for each instance, which serves as guidance for the extraction of instance-level representation from the vision-based bird's eye-view (BEV) feature map. Cohere3D encourages a consistent representation for the same instance at different frames but distinguishes between representations of different instances. We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks. Results show a notable improvement in both data efficiency and task performance.

Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

TL;DR

Abstract

Paper Structure (39 sections, 6 equations, 7 figures, 6 tables)

This paper contains 39 sections, 6 equations, 7 figures, 6 tables.

Introduction
Related Works
Camera-based Autonomous Driving
Pretraining for Perception and Prediction
Representation Learning from Temporal Cues
Methodology
Formulation
LiDAR-guided Temporal Correspondence
Instance Identification
Long-term Matching
Temporal Instance Feature
Depth-aware Contrastive Learning
Architecture
Depth-aware Representation Learning
View Transformation Mask
...and 24 more sections

Figures (7)

Figure 1: Temporal coherence of instance representations. Our algorithm encourages instance representations in the BEV space to be coherent across time and viewpoints.
Figure 2: Overview of Cohere3D. We construct long-term instance-level temporal correspondence via unlabeled LiDAR point clouds, and design a contrastive learning framework to encourage the long-term coherence of instance representations. After pretraining, the online network is transferred to downstream task-specific models.
Figure 3: Construction of long-term temporal correspondence. The first and last scans corresponding to each frame are correlated by clustering. The Hungarian algorithm connects every two neighboring frames to generate long-term correspondence.
Figure 4: Qualitative comparison of SOLOFusion model trained from scratch (left) vs with pretraining (right). Our pretraining is effective at improving recall rates, reducing false positives, and localizing objects in the 3D space.
Figure 5: Performance at different finetuning epochs. Our pretraining leads to faster convergence.
...and 2 more figures

Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

TL;DR

Abstract

Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (7)