Table of Contents
Fetching ...

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

Yan Li, Yifei Xing, Xiangyuan Lan, Xin Li, Haifeng Chen, Dongmei Jiang

TL;DR

Align-Mamba is proposed, an efficient and effective method for multimodal fusion that introduces a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities and proposes a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions.

Abstract

Cross-modal alignment is crucial for multimodal representation fusion due to the inherent heterogeneity between modalities. While Transformer-based methods have shown promising results in modeling inter-modal relationships, their quadratic computational complexity limits their applicability to long-sequence or large-scale data. Although recent Mamba-based approaches achieve linear complexity, their sequential scanning mechanism poses fundamental challenges in comprehensively modeling cross-modal relationships. To address this limitation, we propose AlignMamba, an efficient and effective method for multimodal fusion. Specifically, grounded in Optimal Transport, we introduce a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities. Moreover, we propose a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions. Finally, the unimodal representations after local and global alignment are passed to the Mamba backbone for further cross-modal interaction and multimodal fusion. Extensive experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

TL;DR

Align-Mamba is proposed, an efficient and effective method for multimodal fusion that introduces a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities and proposes a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions.

Abstract

Cross-modal alignment is crucial for multimodal representation fusion due to the inherent heterogeneity between modalities. While Transformer-based methods have shown promising results in modeling inter-modal relationships, their quadratic computational complexity limits their applicability to long-sequence or large-scale data. Although recent Mamba-based approaches achieve linear complexity, their sequential scanning mechanism poses fundamental challenges in comprehensively modeling cross-modal relationships. To address this limitation, we propose AlignMamba, an efficient and effective method for multimodal fusion. Specifically, grounded in Optimal Transport, we introduce a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities. Moreover, we propose a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions. Finally, the unimodal representations after local and global alignment are passed to the Mamba backbone for further cross-modal interaction and multimodal fusion. Extensive experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.

Paper Structure

This paper contains 23 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Transformer leverages attention mechanisms to model relationships across different modalities (top left), whereas Mamba struggles to achieve this due to its sequential scanning mechanism (top right). In contrast, the proposed AlignMamba utilizes both local (OT-based) and global (MMD-based) cross-modal alignment information to achieve efficient and effective multimodal fusion (bottom).
  • Figure 2: AlignMamba enhances multimodal Mamba by incorporating token-level alignment and distribution-level alignment, enabling more effective multimodal fusion.
  • Figure 3: GPU memory usage comparison with varying lengths.
  • Figure 4: Inference time comparison with varying lengths.
  • Figure 5: The learned optimal transport plan. We only show the transport plan between video and language modalities for brevity.