Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with Vision Large Language Model

Naibo Wang; Yuchen Deng; Shichen Fan; Jianwei Yin; See-Kiong Ng

Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with Vision Large Language Model

Naibo Wang, Yuchen Deng, Shichen Fan, Jianwei Yin, See-Kiong Ng

TL;DR

This work tackles privacy-preserving medical image analysis under limited communication by proposing FedMME, a one-shot, multi-modal federated ensemble framework. FedMME combines traditional visual features from a vision model with textual features generated by a vision large language model and processed by BERT, fusing them through dimensionality reduction and a fully connected classifier, with ensemble voting across clients. Across four medical datasets and non-IID Dirichlet partitions, FedMME outperforms existing one-shot FL baselines by up to 17.5% on RSNA (α=0.3) and demonstrates robust performance on the Diabetic Retinopathy dataset, illustrating the value of multi-modal integration under privacy constraints. The approach demonstrates practical impact by reducing communication overhead while improving diagnostic accuracy, highlighting the potential of vision-Language Models in federated medical AI.

Abstract

Federated learning (FL) has attracted considerable interest in the medical domain due to its capacity to facilitate collaborative model training while maintaining data privacy. However, conventional FL methods typically necessitate multiple communication rounds, leading to significant communication overhead and delays, especially in environments with limited bandwidth. One-shot federated learning addresses these issues by conducting model training and aggregation in a single communication round, thereby reducing communication costs while preserving privacy. Among these, one-shot federated ensemble learning combines independently trained client models using ensemble techniques such as voting, further boosting performance in non-IID data scenarios. On the other hand, existing machine learning methods in healthcare predominantly use unimodal data (e.g., medical images or textual reports), which restricts their diagnostic accuracy and comprehensiveness. Therefore, the integration of multi-modal data is proposed to address these shortcomings. In this paper, we introduce FedMME, an innovative one-shot multi-modal federated ensemble learning framework that utilizes multi-modal data for medical image analysis. Specifically, FedMME capitalizes on vision large language models to produce textual reports from medical images, employs a BERT model to extract textual features from these reports, and amalgamates these features with visual features to improve diagnostic accuracy. Experimental results show that our method demonstrated superior performance compared to existing one-shot federated learning methods in healthcare scenarios across four datasets with various data distributions. For instance, it surpasses existing one-shot federated learning approaches by more than 17.5% in accuracy on the RSNA dataset when applying a Dirichlet distribution with ($α$ = 0.3).

Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with Vision Large Language Model

TL;DR

Abstract

= 0.3).

Paper Structure (17 sections, 7 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 17 sections, 7 equations, 8 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Methodology
Problem Definition
Proposed Method: FedMME
Feature Extraction
Feature Fusion
Experiments
Experiments Setup
Baselines
Performance Analysis
Ablation Studies
Comparative analysis of Vision Large Language Model
Effects of different textual feature size
Training convergence analysis
...and 2 more sections

Figures (8)

Figure 1: Overview of the one-shot federated ensemble framework. (I) Federated learning process: each client trains a local model using its private dataset and transmits the model to the central server. (II) Global ensemble learning: the server aggregates model outputs from all clients using a voting mechanism to produce the final decision.
Figure 2: Overview of our proposed framework, FedMME, which is structured into two principal phases: feature extraction and feature fusion. In the feature extraction phase, visual features are obtained from images using conventional vision models, such as ResNet-18, and accompanying textual reports are produced via a sophisticated vision large language model. Textual features are subsequently extracted from these reports using a model designed for textual feature extraction, such as BERT. During the feature fusion phase, these visual features are combined with dimensionality-reduced textual features to create a cohesive feature representation. This combined representation is then processed through a fully connected layer, which enables classification.
Figure 3: Test Accuracy comparison of FedMME with different vision large language models, where Dirichlet parameter $\alpha = 0.3$.
Figure 4: Effect of various textual feature sizes across different datasets, where Dirichlet parameter $\alpha = 0.6$.
Figure 5: Training convergence analysis by various training epochs, where Dirichlet parameter $\alpha = 0.6$.
...and 3 more figures

Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with Vision Large Language Model

TL;DR

Abstract

Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with Vision Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)