Table of Contents
Fetching ...

Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens

Ciem Cornelissen, Sam Leroux, Pieter Simoens

Abstract

Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.

Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens

Abstract

Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.

Paper Structure

This paper contains 47 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of Le MuMo JEPA. The companion modality is represented as a spatially aligned signal and fused with RGB through learnable fusion tokens, which act as a latent bottleneck inside a shared transformer. The default training objective applies SIGReg to the joint multimodal CLS embedding.
  • Figure 2: Waymo dataset showcase. Example synchronized supervision used in our experiments: (a) RGB image, (b) camera-view segmentation, (c) the aligned companion-modality signal shown in depth form for the driving setting, and (d) projected 3D bounding boxes.
  • Figure 3: Waymo patch-embedding visualization. t-SNE projections of final-layer patch embeddings are shown with class-oriented structure and depth-oriented structure, illustrating how the learned patch space organizes both semantic grouping and geometric variation across methods. The left column uses Le MuMo JEPA fusion-token embeddings, and the right column uses LeJEPA patch embeddings. A planar fit to the plotted depth gradient gives an $R^2$ score of $0.463$ for Le MuMo JEPA versus $0.086$ for LeJEPA.
  • Figure 4: Waymo qualitative probe output. The figure shows the RGB input together with predictions from the three probe families used in the paper: dense depth estimation, segmentation, and 3D detection boxes. It provides a direct qualitative view of the same patch-level capabilities summarized by the quantitative probe tables.