Table of Contents
Fetching ...

FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data

Binqian Xu, Xiangbo Shu, Haiyang Mei, Guosen Xie, Basura Fernando, Jinhui Tang

TL;DR

FedMLLM tackles the challenge of privacy-preserving fine-tuning of Multimodal Large Language Models in the presence of multimodal heterogeneity. It introduces a benchmark and a general FedMLLM framework that combines LoRA-based fine-tuning with two modality-agnostic strategies (prompt augmentation and adaptive regularization) to mitigate cross-client modality gaps. Across four multimodal datasets and six baselines, the approach yields improvements over zero-shot and local training, while maintaining affordable communication costs. The work provides a practical pathway for privacy-preserving multimodal adaptation and outlines directions for expanding to additional modalities and cross-device Federated Learning.

Abstract

Multimodal Large Language Models (MLLMs) have made significant advancements, demonstrating powerful capabilities in processing and understanding multimodal data. Fine-tuning MLLMs with Federated Learning (FL) allows for expanding the training data scope by including private data sources, thereby enhancing their practical applicability in privacy-sensitive domains. However, current research remains in the early stage, particularly in addressing the \textbf{multimodal heterogeneities} in real-world applications. In this paper, we introduce a benchmark to evaluate the performance of federated fine-tuning of MLLMs across various multimodal heterogeneous scenarios, laying the groundwork for future research in the field. Our benchmark includes two lightweight MLLMs, two downstream tasks, three evaluation metrics, and five datasets across three domains, along with six comparison baselines, covering over ten types of modality heterogeneities across four multimodal scenarios. To address the challenges posed by multimodal heterogeneity, we develop a general FedMLLM framework that integrates classic FL methods alongside two modality-agnostic strategies. Extensive experimental results show that our proposed FL paradigm improves the performance of MLLMs by broadening the range of training data and mitigating multimodal heterogeneity. Code is available in supplementary materials.

FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data

TL;DR

FedMLLM tackles the challenge of privacy-preserving fine-tuning of Multimodal Large Language Models in the presence of multimodal heterogeneity. It introduces a benchmark and a general FedMLLM framework that combines LoRA-based fine-tuning with two modality-agnostic strategies (prompt augmentation and adaptive regularization) to mitigate cross-client modality gaps. Across four multimodal datasets and six baselines, the approach yields improvements over zero-shot and local training, while maintaining affordable communication costs. The work provides a practical pathway for privacy-preserving multimodal adaptation and outlines directions for expanding to additional modalities and cross-device Federated Learning.

Abstract

Multimodal Large Language Models (MLLMs) have made significant advancements, demonstrating powerful capabilities in processing and understanding multimodal data. Fine-tuning MLLMs with Federated Learning (FL) allows for expanding the training data scope by including private data sources, thereby enhancing their practical applicability in privacy-sensitive domains. However, current research remains in the early stage, particularly in addressing the \textbf{multimodal heterogeneities} in real-world applications. In this paper, we introduce a benchmark to evaluate the performance of federated fine-tuning of MLLMs across various multimodal heterogeneous scenarios, laying the groundwork for future research in the field. Our benchmark includes two lightweight MLLMs, two downstream tasks, three evaluation metrics, and five datasets across three domains, along with six comparison baselines, covering over ten types of modality heterogeneities across four multimodal scenarios. To address the challenges posed by multimodal heterogeneity, we develop a general FedMLLM framework that integrates classic FL methods alongside two modality-agnostic strategies. Extensive experimental results show that our proposed FL paradigm improves the performance of MLLMs by broadening the range of training data and mitigating multimodal heterogeneity. Code is available in supplementary materials.

Paper Structure

This paper contains 14 sections, 3 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: MLLMs$+$FL training on decentralized multimodal data, including both aligned and non-aligned modality training, where non-aligned modality data across clients contain multimodal heterogeneity compared to aligned modality data.
  • Figure 2: Overview of FedMLLM and its performance show. FedMLLM has fully deployed over ten multimodal heterogeneities, four classic FL algorithms, two lightweight MLLMs, two modality-agnostic strategies, and supports six comparison baselines, five private datasets, four multimodal scenarios, three evaluation metrics, and two downstream tasks. The results of the multimodal scenarios come from federated fine-tuning of MLLM on the Hateful-Memes (top) and CrisisMMD (bottom) datasets across twelve multimodal scenarios.
  • Figure 3: Visualization of modality counts and types across four multimodal scenarios on the Hateful-Memes dataset. The horizontal axis represents clients, and the vertical axis shows the sample count. Differences in modality across clients illustrate multimodal heterogeneity.
  • Figure 4: The t-SNE visualization of embeddings across clients from multimodal scenarios, with different colors denoting various clients.
  • Figure 5: Visualization of sample counts across four multimodal scenarios in the four datasets. Data volume differences across clients exacerbate heterogeneity challenges. Particularly the Cross and Hybrid Modal scenarios constructed by MedAlpaca and VQA-RAD.
  • ...and 9 more figures