Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

Yi Wang; Conrad M Albrecht; Nassim Ait Ali Braham; Chenying Liu; Zhitong Xiong; Xiao Xiang Zhu

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham, Chenying Liu, Zhitong Xiong, Xiao Xiang Zhu

TL;DR

DeCUR addresses the limitation of multimodal self-supervised learning that emphasizes cross-modal alignment at the expense of modality-unique information. It decouples embeddings into cross-modal common and modality-unique components, applying cross- and intra-modal redundancy reduction and augmenting with deformable attention to focus on modality-informative regions. The approach yields consistent improvements across SAR-optical, RGB-DEM, and RGB-depth tasks, in both multimodal transfer and modality-missing scenarios, with strong gains in classification and segmentation benchmarks. While effective, it uses a fixed common-unique ratio and requires grid-search to identify optimal splits; future work could explore adaptive decoupling and expansion to more than two modalities. Overall, DeCUR demonstrates the potential of modality-aware representation learning for robust, transferable multimodal understanding.

Abstract

The increasing availability of multi-sensor data sparks wide interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent improvement regardless of architectures and for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide valuable insights and raise more interest in researching the hidden relationships of multimodal representations.

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

TL;DR

Abstract

Paper Structure (49 sections, 7 equations, 8 figures, 7 tables, 2 algorithms)

This paper contains 49 sections, 7 equations, 8 figures, 7 tables, 2 algorithms.

Introduction
Related work
Self-supervised learning
Multimodal self-supervised learning
Modality decoupling
Methodology
Decoupling common and unique representations
Cross-correlation matrix
Cross-modal representation decoupling
Intra-modal representation enhancing
Deformable attention for modality-informative features
Implementation details
Pretraining datasets
Model architecture
Optimization
...and 34 more sections

Figures (8)

Figure 1: Decoupled common and unique representations across two modalities visualized by t-SNE van2008visualizing. Each embedding dimension is one data point. Red and blue circles indicate unique features from modalities A and B; red cross and blue square indicate common features from A and B. The figure shows that common and unique features from different modalities are well separated in the embedding space, and the common features between modalities are well overlapped. Best view in color & zoomed in.
Figure 2: The structure of DeCUR. $M1$ and $M2$ represent two modalities. Two augmented views from each modality are fed to modality-specific encoders ($E1$, $E2$) and projectors ($P1$, $P2$) to get the embeddings $Z$. For cross-modal embeddings, the dimensions are separated into common and unique parts. The correlation matrix of the common dimensions is optimized to be close to the identity, while that of the unique ones to zero. For intra-modal embeddings, both common and unique dimensions are used to calculate the correlation matrix which is optimized to be close to the identity. DeCUR optionally adds deformable attention (the green shadowed region on the right side) in the last layers of ConvNet encoders to boost modality-informative learning.
Figure 3: Ablation results on the percentage of common dimensions and the projector.
Figure 4: Ablation on the decoupling percentage with different embedding dimensionalities (left: SAR-optical; right: RGB-DEM).
Figure 5: Cross-modal representation alignment (left) and DA visualization (right).
...and 3 more figures

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

TL;DR

Abstract

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)