Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality
Liwei Che, Jiaqi Wang, Xinyue Liu, Fenglong Ma
TL;DR
This work tackles modality missing in multi-modal federated learning by introducing FedMVP, a four-module framework that freezes large pre-trained encoders for modality completion and knowledge transfer at the client side, while a server-side aggregation uses a representation-graph built from synthetic data. The approach combines cross-modal generation with prompt augmentation, a joint multi-modal encoder, and two knowledge-transfer losses—Multi-modal Contrastive Matching (MCM) and Representation Aligned Margin (RAM)—alongside a CKA-based aggregation strategy. Experiments on CUB-200-2011 and Oxford Flower demonstrate that FedMVP consistently outperforms baselines under both IID and non-IID settings and remains robust as the missing modality ratio increases, with ablations showing the contributions of each component. Overall, the method provides a scalable, efficient solution for robust multi-modal FL in realistic scenarios with incomplete data, enabling reliable cross-modal learning across distributed silos.
Abstract
Federated learning (FL) has obtained tremendous progress in providing collaborative training solutions for distributed data silos with privacy guarantees. However, few existing works explore a more realistic scenario where the clients hold multiple data modalities. In this paper, we aim to solve a novel challenge in multi-modal federated learning (MFL) -- modality missing -- the clients may lose part of the modalities in their local data sets. To tackle the problems, we propose a novel multi-modal federated learning method, Federated Multi-modal contrastiVe training with Pre-trained completion (FedMVP), which integrates the large-scale pre-trained models to enhance the federated training. In the proposed FedMVP framework, each client deploys a large-scale pre-trained model with frozen parameters for modality completion and representation knowledge transfer, enabling efficient and robust local training. On the server side, we utilize generated data to uniformly measure the representation similarity among the uploaded client models and construct a graph perspective to aggregate them according to their importance in the system. We demonstrate that the model achieves superior performance over two real-world image-text classification datasets and is robust to the performance degradation caused by missing modality.
