Examining Modality Incongruity in Multimodal Federated Learning for Medical Vision and Language-based Disease Detection

Pramit Saha; Divyanshu Mishra; Felix Wagner; Konstantinos Kamnitsas; J. Alison Noble

Examining Modality Incongruity in Multimodal Federated Learning for Medical Vision and Language-based Disease Detection

Pramit Saha, Divyanshu Mishra, Felix Wagner, Konstantinos Kamnitsas, J. Alison Noble

TL;DR

This work analyzes modality incongruity in multimodal federated learning for medical vision-language disease detection using Chest X-ray images and radiology reports from MIMIC-CXR and NIH Open-I. It shows that mixed unimodal and multimodal clients can underperform unimodal federated learning under non-IID data, and proposes practical mitigation routes, including a Modality Imputation Network (MIN), diverse self-attention fusion masks, and client- and server-level regularization/distillation. Among methods, MIN and LOOT (server-assisted) emerge as the most effective in reducing modality gaps and boosting performance across several MMFL settings, with results dependent on data heterogeneity and multimodal client ratios. The findings offer actionable guidance for deploying MMFL in real-world healthcare environments where modality availability varies across institutions. These insights have implications for privacy-preserving collaboration, model robustness, and equitable access to multimodal diagnostic capabilities.

Abstract

Multimodal Federated Learning (MMFL) utilizes multiple modalities in each client to build a more powerful Federated Learning (FL) model than its unimodal counterpart. However, the impact of missing modality in different clients, also called modality incongruity, has been greatly overlooked. This paper, for the first time, analyses the impact of modality incongruity and reveals its connection with data heterogeneity across participating clients. We particularly inspect whether incongruent MMFL with unimodal and multimodal clients is more beneficial than unimodal FL. Furthermore, we examine three potential routes of addressing this issue. Firstly, we study the effectiveness of various self-attention mechanisms towards incongruity-agnostic information fusion in MMFL. Secondly, we introduce a modality imputation network (MIN) pre-trained in a multimodal client for modality translation in unimodal clients and investigate its potential towards mitigating the missing modality problem. Thirdly, we assess the capability of client-level and server-level regularization techniques towards mitigating modality incongruity effects. Experiments are conducted under several MMFL settings on two publicly available real-world datasets, MIMIC-CXR and Open-I, with Chest X-Ray and radiology reports.

Examining Modality Incongruity in Multimodal Federated Learning for Medical Vision and Language-based Disease Detection

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 31 figures, 6 tables)

This paper contains 21 sections, 7 equations, 31 figures, 6 tables.

Introduction
Related Works
Preliminaries and Problem Setup
Modality incongruity in MFFL
Self-attention Mechanisms
Incongruent to pseudo-congruent MFFL
Towards modality-invariance in MFFL
Client-level solutions
Server-level solutions
Performance analysis
Conclusion
Impact Statement
Dataset Details
Implementation Details
Formulation of Self-attention Mechanism
...and 6 more sections

Figures (31)

Figure 1: Overview of problem settings. Here, only 1 out of 4 clients have both modalities, i.e., CXR image and radiology report.
Figure 2: Four self-attention schemes used in multimodal client(s). (a) Isolated (b) Causal (c) Partially Bidirectional (d) Bidirectional.
Figure 3: Modality Imputation Network (MIN) Training procedure
Figure 4: Illustration of client-level solutions in a 3-client FL scenario - one multimodal client ($M$) and two unimodal clients ($U_1$ and $U_2$). (a) shows the model-based regularization technique of FedProx (in blue) and FedMultiProx (in red). The global model $G$ is replaced by $G_u$ in multimodal clients and $M$ in unimodal clients. (b) shows the representation-based regularization technique of MOON (in blue) and MultiMOON (in red). (c) shows the Modality-aware Knowledge Distillation technique (MAD) and MAD+. $M^*, U_1^*, U_2^*$ represent pre-trained models, i.e., the first teacher model $T_1$. $G$ denotes the second teacher model $T_2$ in all the clients for MAD. For MAD+, $G_u$ denotes the second teacher model in the multimodal client and $M$ denotes the second teacher model in the unimodal clients.
Figure 5: Server-level solutions - LOOT vs FedDF
...and 26 more figures

Examining Modality Incongruity in Multimodal Federated Learning for Medical Vision and Language-based Disease Detection

TL;DR

Abstract

Examining Modality Incongruity in Multimodal Federated Learning for Medical Vision and Language-based Disease Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (31)