Table of Contents
Fetching ...

Toward Unified Multimodal Representation Learning for Autonomous Driving

Ximeng Tao, Dimitar Filev, Gaurav Pandey

TL;DR

A Contrastive Tensor Pre-training framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving and introduces a tensor loss to enable joint contrastive learning across all modalities.

Abstract

Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a tensor loss to enable joint contrastive learning across all modalities. For experimental validation of our framework, we construct a text-image-point cloud triplet dataset derived from existing autonomous driving datasets. The results show that our proposed unified multimodal alignment framework achieves favorable performance for both scenarios: (i) aligning a 3D encoder with pretrained CLIP encoders, and (ii) pretraining all encoders from scratch.

Toward Unified Multimodal Representation Learning for Autonomous Driving

TL;DR

A Contrastive Tensor Pre-training framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving and introduces a tensor loss to enable joint contrastive learning across all modalities.

Abstract

Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a tensor loss to enable joint contrastive learning across all modalities. For experimental validation of our framework, we construct a text-image-point cloud triplet dataset derived from existing autonomous driving datasets. The results show that our proposed unified multimodal alignment framework achieves favorable performance for both scenarios: (i) aligning a 3D encoder with pretrained CLIP encoders, and (ii) pretraining all encoders from scratch.
Paper Structure (21 sections, 10 equations, 5 figures, 5 tables)

This paper contains 21 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of different contrastive representation learning methods. (a) CLIP: aligns two modalities. (b) Aligns a third modality with two already aligned modalities. (c) Performs pairwise alignment between every modality pair. (d) CTP: aligns all modalities toward one point.
  • Figure 2: Overview of the multimodal language model–based end-to-end autonomous driving system. The multimodal encoder is pretrained to align all modalities within a unified embedding space, enabling the LLM to jointly understand cross-modal information and generate reasoning, scene descriptions, and future trajectory predictions. In this work, we primarily focus on training the "Multimodal Encoder" that can be used in an end-to-end driving system as shown here.
  • Figure 3: Overview of CTP framework. Triplet dataset: LiDAR point clouds, cropped images, and annotations are extracted from autonomous driving datasets to form triplet samples. The annotation expanded into a detailed caption using a VLM. similarity tensor: Image, text, and point cloud features are arranged along the $x$, $y$, and $z$ axes to form a 3D similarity tensor. Each element represents a unique combination of three features, and the similarity measures their relationships. During training, the similarity scores of the matched triplets (small purple cubes) are maximized using a cross-entropy loss.
  • Figure 4: Different strategies for flattening the plane loss: (a) nm: direct flattening without masking, (b) mask: masking duplicated entries, CTP adopts (b) as the standard flattening strategy.
  • Figure 5: Zero-shot classification. Each image--point cloud pair is compared with all text features, and the class is determined by the highest L2-norm similarity. When computing similarity between point cloud or image features and text features only, cosine similarity is employed.