Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham, Chenying Liu, Zhitong Xiong, Xiao Xiang Zhu
TL;DR
DeCUR addresses the limitation of multimodal self-supervised learning that emphasizes cross-modal alignment at the expense of modality-unique information. It decouples embeddings into cross-modal common and modality-unique components, applying cross- and intra-modal redundancy reduction and augmenting with deformable attention to focus on modality-informative regions. The approach yields consistent improvements across SAR-optical, RGB-DEM, and RGB-depth tasks, in both multimodal transfer and modality-missing scenarios, with strong gains in classification and segmentation benchmarks. While effective, it uses a fixed common-unique ratio and requires grid-search to identify optimal splits; future work could explore adaptive decoupling and expansion to more than two modalities. Overall, DeCUR demonstrates the potential of modality-aware representation learning for robust, transferable multimodal understanding.
Abstract
The increasing availability of multi-sensor data sparks wide interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent improvement regardless of architectures and for both multimodal and modality-missing settings. With thorough experiments and comprehensive analysis, we hope this work can provide valuable insights and raise more interest in researching the hidden relationships of multimodal representations.
