Table of Contents
Fetching ...

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li

TL;DR

Federated learning faces data heterogeneity and long-tailed class distributions, which degrade performance and impair fairness. The authors propose MLLM-LLaVA-FL, a three-stage framework that leverages server-side multimodal large language models for global multimodal pretraining, federated finetuning, and global alignment, while keeping client computation light and preserving privacy. A Dynamic Weighted Distillation mechanism combines features from a compact, trainable FL model with a frozen CLIP encoder, and a global alignment loss $L_{align} = L_{ce}(y,p) + \beta \cdot KL(q \| p)$ mitigates long-tail bias using a balanced alignment dataset. Empirical results on CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT show gains over baselines, especially for minority classes, demonstrating improved robustness and practical advantages such as reduced client burden and enhanced privacy.

Abstract

Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of MLLMs. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

TL;DR

Federated learning faces data heterogeneity and long-tailed class distributions, which degrade performance and impair fairness. The authors propose MLLM-LLaVA-FL, a three-stage framework that leverages server-side multimodal large language models for global multimodal pretraining, federated finetuning, and global alignment, while keeping client computation light and preserving privacy. A Dynamic Weighted Distillation mechanism combines features from a compact, trainable FL model with a frozen CLIP encoder, and a global alignment loss mitigates long-tail bias using a balanced alignment dataset. Empirical results on CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT show gains over baselines, especially for minority classes, demonstrating improved robustness and practical advantages such as reduced client burden and enhanced privacy.

Abstract

Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of MLLMs. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.
Paper Structure (24 sections, 6 equations, 4 figures, 4 tables)

This paper contains 24 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The workflow of MLLM-LLaVA-FL. The MLLM are utilized in the first stage Global Multimodal Pretraining and the third stage Global Alignment on the server side, to avoid extra computational load on devices.
  • Figure 2: The visualization of our pretraining mechanism
  • Figure 3: The comparative analysis of pretrained and non-pretrained models using 1% subsets of CIFAR-10/100 training data.
  • Figure 4: The comparative analysis of aligned and non-aligned models with normalized confusion matrices.