Table of Contents
Fetching ...

DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

Chengxuan Qian, Shuo Xing, Shawn Li, Yue Zhao, Zhengzhong Tu

TL;DR

DecAlign tackles cross-modal heterogeneity by explicitly decoupling modality-heterogeneous and modality-common representations and applying a two-tier alignment: prototype-guided multi-marginal optimal transport for heterogeneous interactions and latent-space distribution matching with MMD for homogeneous semantics. A multimodal transformer refines high-level cross-modal cues, and modality-specific fusion preserves unimodal detail prior to prediction. Across four benchmarks and multiple metrics, DecAlign consistently outperforms 13 baselines, demonstrating improved MAE, F1, accuracy, and correlation while maintaining modality uniqueness. The approach offers a principled, scalable path toward robust, interpretable multimodal representations with strong generalization potential.

Abstract

Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at https://taco-group.github.io/DecAlign.

DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

TL;DR

DecAlign tackles cross-modal heterogeneity by explicitly decoupling modality-heterogeneous and modality-common representations and applying a two-tier alignment: prototype-guided multi-marginal optimal transport for heterogeneous interactions and latent-space distribution matching with MMD for homogeneous semantics. A multimodal transformer refines high-level cross-modal cues, and modality-specific fusion preserves unimodal detail prior to prediction. Across four benchmarks and multiple metrics, DecAlign consistently outperforms 13 baselines, demonstrating improved MAE, F1, accuracy, and correlation while maintaining modality uniqueness. The approach offers a principled, scalable path toward robust, interpretable multimodal representations with strong generalization potential.

Abstract

Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at https://taco-group.github.io/DecAlign.

Paper Structure

This paper contains 27 sections, 15 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: DecAlign achieves superior performance compared to state-of-the-art methods across multiple multimodal benchmarks. The bubble size represents relative model performance, illustrating the trade-off between Acc-2 and Binary F1 Score.
  • Figure 2: The framework of our proposed DecAlign approach, illustrated in a multimodal setting with visual, audio, and language inputs. Modality Feature Encoders first extract unimodal embeddings, which are then decoupled into modality heterogeneous and homogeneous components by modality-unique/common encoders. Heterogeneous features are aligned via optimal transport-based cross-modal prototypes, and homogeneous semantics are aligned through latent space semantics and Maximum Mean Discrepancy-based distribution matching. Heterogeneous features are refined by a multimodal transformer for capturing finer-grained cross-modal interactions, then concatenated with homogeneous features and passed through a fully connected layer for downstream tasks.
  • Figure 3: Comparison of predicted versus ground truth category distributions for four representative models on the CMU-MOSI dataset.
  • Figure 4: Visualization of Ablation Studies. (a)–(d) illustrate the performance comparison across different emotion categories on four benchmarks, (e)–(h) visualize the modality gap between visual and language modalities on the CMU-MOSEI dataset.
  • Figure 5: Hyperparameter sensitivity analysis on CMU-MOSI and CMU-MOSEI in terms of Binary F1 Score.