HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis
J. Dhar, M. K. Pandey, D. Chakladar, M. Haghighat, A. Alavi, S. Mistry, N. Zaidi
TL;DR
HyPCA-Net tackles computationally heavy multimodal fusion and information loss by introducing robust multimodal learning in RMIL and multitask fusion in MML. It integrates a Residual Adaptive Learning Attention (RALA) block and a Dual-View Cascaded Attention (DVCA) block, where RALA uses a Spatial–Channel convolution Adaptive Learning Attention (SCALA) with Multi-scale Spatial Heterogeneous Convolution (MSHC) and SCPFA, and DVCA combines Hybrid Space Fusion Attention (Hy-SFA) with Multi-scale Multi-frequency Mutual Update Attention (MMMUA) to fuse information across token- and frequency-spaces. Evaluations on ten datasets show HyPCA-Net achieves up to $5.2$ percentage-point improvement and up to $73.1\%$ reduction in computation relative to SOTA. The results demonstrate strong generalization across modalities and diseases, with a scalable design suitable for low-resource environments.
Abstract
Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: https://github.com/misti1203/HyPCA-Net.
