Table of Contents
Fetching ...

4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis

Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun

TL;DR

This work tackles early Alzheimer's disease diagnosis by fusing 4D fMRI with 3D sMRI to capture complementary dynamic and structural brain information. It introduces M2M-AlignNet, featuring a geometry-aware multi-patch-to-multi-patch (M2M) latent alignment and a latent-as-query co-attention fusion strategy, built on a 4D Swin Transformer backbone. The key innovations are the M2M contrastive loss, defined over patch-wise similarities $S^t \in \mathbb{R}^{C\times C}$ and optimized via $l_{M2M}^{t,(i,j)}$, which enables many-to-many cross-modal alignment with adaptive weights $w^{t,(i,k)}$ derived from a discrepancy function $\mathcal{D}$, and a co-attention mechanism that autonomously discovers fusion patterns through trainable latent queries. Extensive experiments on EHBS, ADNI, and HCP demonstrate improved diagnostic performance and reveal interpretable brain-region correspondences between fMRI and sMRI, validating the framework’s effectiveness for robust multimodal AD biomarkers.

Abstract

Multimodal neuroimaging provides complementary structural and functional insights into both human brain organization and disease-related dynamics. Recent studies demonstrate enhanced diagnostic sensitivity for Alzheimer's disease (AD) through synergistic integration of neuroimaging data (e.g., sMRI, fMRI) with behavioral cognitive scores tabular data biomarkers. However, the intrinsic heterogeneity across modalities (e.g., 4D spatiotemporal fMRI dynamics vs. 3D anatomical sMRI structure) presents critical challenges for discriminative feature fusion. To bridge this gap, we propose M2M-AlignNet: a geometry-aware multimodal co-attention network with latent alignment for early AD diagnosis using sMRI and fMRI. At the core of our approach is a multi-patch-to-multi-patch (M2M) contrastive loss function that quantifies and reduces representational discrepancies via geometry-weighted patch correspondence, explicitly aligning fMRI components across brain regions with their sMRI structural substrates without one-to-one constraints. Additionally, we propose a latent-as-query co-attention module to autonomously discover fusion patterns, circumventing modality prioritization biases while minimizing feature redundancy. We conduct extensive experiments to confirm the effectiveness of our method and highlight the correspondance between fMRI and sMRI as AD biomarkers.

4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis

TL;DR

This work tackles early Alzheimer's disease diagnosis by fusing 4D fMRI with 3D sMRI to capture complementary dynamic and structural brain information. It introduces M2M-AlignNet, featuring a geometry-aware multi-patch-to-multi-patch (M2M) latent alignment and a latent-as-query co-attention fusion strategy, built on a 4D Swin Transformer backbone. The key innovations are the M2M contrastive loss, defined over patch-wise similarities and optimized via , which enables many-to-many cross-modal alignment with adaptive weights derived from a discrepancy function , and a co-attention mechanism that autonomously discovers fusion patterns through trainable latent queries. Extensive experiments on EHBS, ADNI, and HCP demonstrate improved diagnostic performance and reveal interpretable brain-region correspondences between fMRI and sMRI, validating the framework’s effectiveness for robust multimodal AD biomarkers.

Abstract

Multimodal neuroimaging provides complementary structural and functional insights into both human brain organization and disease-related dynamics. Recent studies demonstrate enhanced diagnostic sensitivity for Alzheimer's disease (AD) through synergistic integration of neuroimaging data (e.g., sMRI, fMRI) with behavioral cognitive scores tabular data biomarkers. However, the intrinsic heterogeneity across modalities (e.g., 4D spatiotemporal fMRI dynamics vs. 3D anatomical sMRI structure) presents critical challenges for discriminative feature fusion. To bridge this gap, we propose M2M-AlignNet: a geometry-aware multimodal co-attention network with latent alignment for early AD diagnosis using sMRI and fMRI. At the core of our approach is a multi-patch-to-multi-patch (M2M) contrastive loss function that quantifies and reduces representational discrepancies via geometry-weighted patch correspondence, explicitly aligning fMRI components across brain regions with their sMRI structural substrates without one-to-one constraints. Additionally, we propose a latent-as-query co-attention module to autonomously discover fusion patterns, circumventing modality prioritization biases while minimizing feature redundancy. We conduct extensive experiments to confirm the effectiveness of our method and highlight the correspondance between fMRI and sMRI as AD biomarkers.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: M2M-AlignNet: Modalities are first encoded by the corresponding modality-specific encoders, then fused via co-attention. fMRI and sMRI representations are further aligned in the latent space via M2M contrastive loss.
  • Figure 2: M2M contrastive loss to align pairs of fMRI patches with sMRI patches at each time point. Multiple fMRI patches can be aligned with multiple sMRI patches.
  • Figure 3: Visualizations of the key brain regions contribute to the framework. We compute the spatial co-attention scores for fMRI and sMRI, map them onto the atlas, then apply a 95% threshold for better visualization. In the bottom, we show the top five regions with the highest scores. Note that "CN" represents the healthy control, and "AD" represents the patients.
  • Figure 4: Visualizations of the top 3 brain states contributing to the diagnosis. We compute the temporal latent co-attention scores for fMRI and calculate the functional connectivity for 3 brain states that have the highest scores. Here, "SC" stands for subcortical network, "AUD" stands for auditory network, "SM" stands for sensorimotor network, "VIS" stands for visual network, "CC" stands for cognitive-control network, "DM" stands for default-mode network, and "CB" stands for cerebellar network.
  • Figure 5: t-SNE visualizations of fMRI and sMRI embeddings in the latent space. The proposed dot-product M2M contrastive alignment produces more concentrated embeddings, with fMRI and sMRI distributions appearing nearly "orthogonal" to each other, indicating effective alignment. In contrast, without alignment or when using JSD for self-weighting, the embeddings show no significant distributional differences.