Table of Contents
Fetching ...

HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

J. Dhar, M. K. Pandey, D. Chakladar, M. Haghighat, A. Alavi, S. Mistry, N. Zaidi

TL;DR

HyPCA-Net tackles computationally heavy multimodal fusion and information loss by introducing robust multimodal learning in RMIL and multitask fusion in MML. It integrates a Residual Adaptive Learning Attention (RALA) block and a Dual-View Cascaded Attention (DVCA) block, where RALA uses a Spatial–Channel convolution Adaptive Learning Attention (SCALA) with Multi-scale Spatial Heterogeneous Convolution (MSHC) and SCPFA, and DVCA combines Hybrid Space Fusion Attention (Hy-SFA) with Multi-scale Multi-frequency Mutual Update Attention (MMMUA) to fuse information across token- and frequency-spaces. Evaluations on ten datasets show HyPCA-Net achieves up to $5.2$ percentage-point improvement and up to $73.1\%$ reduction in computation relative to SOTA. The results demonstrate strong generalization across modalities and diseases, with a scalable design suitable for low-resource environments.

Abstract

Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: https://github.com/misti1203/HyPCA-Net.

HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

TL;DR

HyPCA-Net tackles computationally heavy multimodal fusion and information loss by introducing robust multimodal learning in RMIL and multitask fusion in MML. It integrates a Residual Adaptive Learning Attention (RALA) block and a Dual-View Cascaded Attention (DVCA) block, where RALA uses a Spatial–Channel convolution Adaptive Learning Attention (SCALA) with Multi-scale Spatial Heterogeneous Convolution (MSHC) and SCPFA, and DVCA combines Hybrid Space Fusion Attention (Hy-SFA) with Multi-scale Multi-frequency Mutual Update Attention (MMMUA) to fuse information across token- and frequency-spaces. Evaluations on ten datasets show HyPCA-Net achieves up to percentage-point improvement and up to reduction in computation relative to SOTA. The results demonstrate strong generalization across modalities and diseases, with a scalable design suitable for low-resource environments.

Abstract

Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: https://github.com/misti1203/HyPCA-Net.
Paper Structure (10 sections, 7 equations, 5 figures, 6 tables)

This paper contains 10 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of HyPCA-Net framework composing of two phases: (A) RMIL, which learns robust shared representations $\{x_i^{S}\}_{i=1}^{m}$; and (B) MML, which performs multi-disease classification. Within RMIL phase stands our novel (C) HyPCA block comprises of RALA block (refining unimodal features $\{x'_i\}_{i=1}^{m}$) and (D) DVCA block (capturing dual-domain multimodal information)
  • Figure 2: (A) Overview of RALA block (first main component of HyPCA), which is composed of SCALA blocks (B), which in turn is made out of MSHC block (C) and SCPFA block (D). The SCPFA block is composed of SHIA (E) and CHIA (F) components.
  • Figure 3: Overview of Hy-SFA (key component of DVCA block) comprising (A) TFSI and (B) FDCA blocks. FDCA includes Heterogeneous Channel Attention (HCA).
  • Figure 4: (A) Overview of the MMMUA block (key component of DVCA block). It consists of MCBI (B), FCIF (C), and SMIF (D) blocks. The MCBI block integrates the outputs of FCIF and SMIF blocks. FCIF employs Hierarchical Channel Fusion (HCF) mechanism.
  • Figure 5: Visual representation of the important regions highlighted by our proposed $\texttt{HyPCA-Net}$ framework and ten other SOTA methods using the $\texttt{GRAD-CAM}$ technique on two benchmark datasets $\texttt{D5}$ and $\texttt{D1}$.