An Interpretable Cross-Attentive Multi-modal MRI Fusion Framework for Schizophrenia Diagnosis

Ziyu Zhou; Anton Orlichenko; Gang Qu; Zening Fu; Vince D Calhoun; Zhengming Ding; Yu-Ping Wang

An Interpretable Cross-Attentive Multi-modal MRI Fusion Framework for Schizophrenia Diagnosis

Ziyu Zhou, Anton Orlichenko, Gang Qu, Zening Fu, Vince D Calhoun, Zhengming Ding, Yu-Ping Wang

TL;DR

Schizophrenia diagnosis benefits from multi-modal MRI, yet fMRI and sMRI heterogeneity hinders simple fusion. The authors propose CAMF, a Cross-Attentive Multi-modal Fusion framework that uses self-attention to model intra-modal interactions and cross-attention to capture inter-modal interactions, fused adaptively into a final representation $f_O$ for classification. Training relies on standard cross-entropy loss with the Adam optimizer and He initialization, and Score-CAM provides interpretable saliency maps identifying disease-relevant networks and regions. Across combined COBRE/FBIRN/MPRC data and the BSNIP dataset, CAMF outperforms baselines and yields biomarker-consistent interpretations, highlighting its potential for diagnostic accuracy and mechanistic insight into schizophrenia.

Abstract

Both functional and structural magnetic resonance imaging (fMRI and sMRI) are widely used for the diagnosis of mental disorder. However, combining complementary information from these two modalities is challenging due to their heterogeneity. Many existing methods fall short of capturing the interaction between these modalities, frequently defaulting to a simple combination of latent features. In this paper, we propose a novel Cross-Attentive Multi-modal Fusion framework (CAMF), which aims to capture both intra-modal and inter-modal relationships between fMRI and sMRI, enhancing multi-modal data representation. Specifically, our CAMF framework employs self-attention modules to identify interactions within each modality while cross-attention modules identify interactions between modalities. Subsequently, our approach optimizes the integration of latent features from both modalities. This approach significantly improves classification accuracy, as demonstrated by our evaluations on two extensive multi-modal brain imaging datasets, where CAMF consistently outperforms existing methods. Furthermore, the gradient-guided Score-CAM is applied to interpret critical functional networks and brain regions involved in schizophrenia. The bio-markers identified by CAMF align with established research, potentially offering new insights into the diagnosis and pathological endophenotypes of schizophrenia.

An Interpretable Cross-Attentive Multi-modal MRI Fusion Framework for Schizophrenia Diagnosis

TL;DR

for classification. Training relies on standard cross-entropy loss with the Adam optimizer and He initialization, and Score-CAM provides interpretable saliency maps identifying disease-relevant networks and regions. Across combined COBRE/FBIRN/MPRC data and the BSNIP dataset, CAMF outperforms baselines and yields biomarker-consistent interpretations, highlighting its potential for diagnostic accuracy and mechanistic insight into schizophrenia.

Abstract

Paper Structure (26 sections, 8 equations, 3 figures, 3 tables)

This paper contains 26 sections, 8 equations, 3 figures, 3 tables.

Introduction
Methodology
Preliminary and Motivation
Cross-Attentive Multi-modal Fusion
Framework overview.
Feature Extraction from Multi-modal MRIs
Intra- and Inter-Modality Interaction
Adaptive Cross-Modal Fusion
Objective Function and Optimization
Experiments
Dataset
Combined Dataset
Bipolar and Schizophrenia Network for Intermediate Phenotypes (BSNIP)
Data preprocessing
Comparison Experiment
...and 11 more sections

Figures (3)

Figure 1: Overview of the proposed framework. The backbones consist of two CNN modules to extract features from the fMRI and sMRI data. Then two self-attention (SA) modules and two cross-attention (CA) modules fuse the features at the first level. The latent features are then combined by the optimal weights and input to a classifier.
Figure 2: Average saliency map from the FC modality, where the rows and columns in the FC matrix are grouped by different functional brain regions, which are highlighted in various colored boxes. We observe a high correlation with each box.
Figure 3: Voxel-level and region-level Score-CAM saliency maps from x-, y- and z- axis. In the region-level saliency maps the intensity is averaged inside each brain region defined by the automated anatomical labeling (AAL) tzourio2002automated atlas. Subfigure (a)-(c) are the voxel-level average saliency map and subfiture (d)-(f) are the region-level average saliency map. For the subfigure of each axis, we choose the slice in the middle of both poles to visualize the highlighted voxels/regions in the center of the brain.

An Interpretable Cross-Attentive Multi-modal MRI Fusion Framework for Schizophrenia Diagnosis

TL;DR

Abstract

An Interpretable Cross-Attentive Multi-modal MRI Fusion Framework for Schizophrenia Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)