MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning
Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li
TL;DR
Federated learning faces data heterogeneity and long-tailed class distributions, which degrade performance and impair fairness. The authors propose MLLM-LLaVA-FL, a three-stage framework that leverages server-side multimodal large language models for global multimodal pretraining, federated finetuning, and global alignment, while keeping client computation light and preserving privacy. A Dynamic Weighted Distillation mechanism combines features from a compact, trainable FL model with a frozen CLIP encoder, and a global alignment loss $L_{align} = L_{ce}(y,p) + \beta \cdot KL(q \| p)$ mitigates long-tail bias using a balanced alignment dataset. Empirical results on CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT show gains over baselines, especially for minority classes, demonstrating improved robustness and practical advantages such as reduced client burden and enhanced privacy.
Abstract
Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of MLLMs. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.
