Examining Modality Incongruity in Multimodal Federated Learning for Medical Vision and Language-based Disease Detection
Pramit Saha, Divyanshu Mishra, Felix Wagner, Konstantinos Kamnitsas, J. Alison Noble
TL;DR
This work analyzes modality incongruity in multimodal federated learning for medical vision-language disease detection using Chest X-ray images and radiology reports from MIMIC-CXR and NIH Open-I. It shows that mixed unimodal and multimodal clients can underperform unimodal federated learning under non-IID data, and proposes practical mitigation routes, including a Modality Imputation Network (MIN), diverse self-attention fusion masks, and client- and server-level regularization/distillation. Among methods, MIN and LOOT (server-assisted) emerge as the most effective in reducing modality gaps and boosting performance across several MMFL settings, with results dependent on data heterogeneity and multimodal client ratios. The findings offer actionable guidance for deploying MMFL in real-world healthcare environments where modality availability varies across institutions. These insights have implications for privacy-preserving collaboration, model robustness, and equitable access to multimodal diagnostic capabilities.
Abstract
Multimodal Federated Learning (MMFL) utilizes multiple modalities in each client to build a more powerful Federated Learning (FL) model than its unimodal counterpart. However, the impact of missing modality in different clients, also called modality incongruity, has been greatly overlooked. This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients. We particularly inspect whether incongruent MMFL with unimodal and multimodal clients is more beneficial than unimodal FL. Furthermore, we examine three potential routes of addressing this issue. Firstly, we study the effectiveness of various self-attention mechanisms towards incongruity-agnostic information fusion in MMFL. Secondly, we introduce a modality imputation network (MIN) pre-trained in a multimodal client for modality translation in unimodal clients and investigate its potential towards mitigating the missing modality problem. Thirdly, we assess the capability of client-level and server-level regularization techniques towards mitigating modality incongruity effects. Experiments are conducted under several MMFL settings on two publicly available real-world datasets, MIMIC-CXR and Open-I, with Chest X-Ray and radiology reports.
