Table of Contents
Fetching ...

Enhancing Multimodal Medical Image Classification using Cross-Graph Modal Contrastive Learning

Jun-En Ding, Chien-Chin Hsu, Chi-Hsiang Chu, Shuqiang Wang, Feng Liu

TL;DR

CGMCL presents a dual-graph cross-modal framework that aligns image and non-image medical data in a shared latent space using a graph attention encoder and a cross-graph contrastive loss. By constructing separate modality graphs for imaging and clinical features and introducing an IMFES module, CGMCL balances heterogeneous data distributions while preserving structural relationships. The method demonstrates improved accuracy, interpretability, and robustness on Parkinson’s disease SPECT data and a melanoma multimodal dataset, offering clearer Grad-CAM localization and actionable meta-feature insights. Overall, CGMCL advances multimodal medical classification by enabling cross-modal alignment, structured fusion, and clinically meaningful interpretation with scalable efficiency.

Abstract

The classification of medical images is a pivotal aspect of disease diagnosis, often enhanced by deep learning techniques. However, traditional approaches typically focus on unimodal medical image data, neglecting the integration of diverse non-image patient data. This paper proposes a novel Cross-Graph Modal Contrastive Learning (CGMCL) framework for multimodal structured data from different data domains to improve medical image classification. The model effectively integrates both image and non-image data by constructing cross-modality graphs and leveraging contrastive learning to align multimodal features in a shared latent space. An inter-modality feature scaling module further optimizes the representation learning process by reducing the gap between heterogeneous modalities. The proposed approach is evaluated on two datasets: a Parkinson's disease (PD) dataset and a public melanoma dataset. Results demonstrate that CGMCL outperforms conventional unimodal methods in accuracy, interpretability, and early disease prediction. Additionally, the method shows superior performance in multi-class melanoma classification. The CGMCL framework provides valuable insights into medical image classification while offering improved disease interpretability and predictive capabilities.

Enhancing Multimodal Medical Image Classification using Cross-Graph Modal Contrastive Learning

TL;DR

CGMCL presents a dual-graph cross-modal framework that aligns image and non-image medical data in a shared latent space using a graph attention encoder and a cross-graph contrastive loss. By constructing separate modality graphs for imaging and clinical features and introducing an IMFES module, CGMCL balances heterogeneous data distributions while preserving structural relationships. The method demonstrates improved accuracy, interpretability, and robustness on Parkinson’s disease SPECT data and a melanoma multimodal dataset, offering clearer Grad-CAM localization and actionable meta-feature insights. Overall, CGMCL advances multimodal medical classification by enabling cross-modal alignment, structured fusion, and clinically meaningful interpretation with scalable efficiency.

Abstract

The classification of medical images is a pivotal aspect of disease diagnosis, often enhanced by deep learning techniques. However, traditional approaches typically focus on unimodal medical image data, neglecting the integration of diverse non-image patient data. This paper proposes a novel Cross-Graph Modal Contrastive Learning (CGMCL) framework for multimodal structured data from different data domains to improve medical image classification. The model effectively integrates both image and non-image data by constructing cross-modality graphs and leveraging contrastive learning to align multimodal features in a shared latent space. An inter-modality feature scaling module further optimizes the representation learning process by reducing the gap between heterogeneous modalities. The proposed approach is evaluated on two datasets: a Parkinson's disease (PD) dataset and a public melanoma dataset. Results demonstrate that CGMCL outperforms conventional unimodal methods in accuracy, interpretability, and early disease prediction. Additionally, the method shows superior performance in multi-class melanoma classification. The CGMCL framework provides valuable insights into medical image classification while offering improved disease interpretability and predictive capabilities.

Paper Structure

This paper contains 35 sections, 25 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: The semi-quantification of parameters was derived from the DaTQUANT package and described the location of the striatum, relating to the entire striatum, caudate nucleus, entire putamen, anterior putamen, and posterior putamen takatsu2023dysfunction.
  • Figure 2: The four neural network multimodal fusion methods are as follows: (a) and (b) represent conventional and widely-used vectors, with (a) utilizing vector concatenation and (b) employing attention-based modal learning. Method (c) uses a joint network for feature extraction from diverse modalities. Finally, method (d) illustrates our proposed cross-graph modal fusion, incorporating a graph structure.
  • Figure 3: The framework of multimodal cross-graph fusion for constructing a common feature space with contrastive learning.
  • Figure 4: Three subtypes of PD annotation. Early abnormalities typically manifest as unilateral putamen decline affecting the P/C ratio and symmetry, with progression involving AP and caudate until bilateral symmetric reduction; overall status is summarized by S = C + AP + PP.
  • Figure 5: Robustness evaluation of the proposed CGMCL model under Gaussian noise perturbations ($\sigma=0.01$–$0.10$) for three SPECT-based PD subtyping tasks.
  • ...and 8 more figures