Table of Contents
Fetching ...

FedVLM: Scalable Personalized Vision-Language Models through Federated Learning

Arkajyoti Mitra, Afia Anjum, Paul Agbaje, Mert Pesé, Habeeb Olufowobi

TL;DR

The paper tackles the challenge of privately and efficiently fine-tuning large vision-language models in federated settings with non-iid data distributions. It introduces FedVLM, a federated LoRA-based fine-tuning framework, and a novel personalized LoRA variant (pLoRA) that shares only the B matrix globally while learning client-specific A_p to maximize local adaptation. Through experiments on the RLAIF-V dataset, FedVLM with pLoRA achieves significant gains over standard LoRA and other FL-based baselines, including 24.5% higher accuracy in non-iid settings and faster convergence than centralized training. This work enables scalable, privacy-preserving, personalized VLM deployment on edge devices and across distributed environments, with potential extensions to other VLM architectures and broader FL strategies.

Abstract

Vision-language models (VLMs) demonstrate impressive zero-shot and few-shot learning capabilities, making them essential for several downstream tasks. However, fine-tuning these models at scale remains challenging, particularly in federated environments where data is decentralized and non-iid across clients. Existing parameter-efficient tuning methods like LoRA (Low-Rank Adaptation) reduce computational overhead but struggle with heterogeneous client data, leading to suboptimal generalization. To address these challenges, we propose FedVLM, a federated LoRA fine-tuning framework that enables decentralized adaptation of VLMs while preserving model privacy and reducing reliance on centralized training. To further tackle data heterogeneity, we introduce personalized LoRA (pLoRA), which dynamically adapts LoRA parameters to each client's unique data distribution, significantly improving local adaptation while maintaining global model aggregation. Experiments on the RLAIF-V dataset show that pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings. FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.

FedVLM: Scalable Personalized Vision-Language Models through Federated Learning

TL;DR

The paper tackles the challenge of privately and efficiently fine-tuning large vision-language models in federated settings with non-iid data distributions. It introduces FedVLM, a federated LoRA-based fine-tuning framework, and a novel personalized LoRA variant (pLoRA) that shares only the B matrix globally while learning client-specific A_p to maximize local adaptation. Through experiments on the RLAIF-V dataset, FedVLM with pLoRA achieves significant gains over standard LoRA and other FL-based baselines, including 24.5% higher accuracy in non-iid settings and faster convergence than centralized training. This work enables scalable, privacy-preserving, personalized VLM deployment on edge devices and across distributed environments, with potential extensions to other VLM architectures and broader FL strategies.

Abstract

Vision-language models (VLMs) demonstrate impressive zero-shot and few-shot learning capabilities, making them essential for several downstream tasks. However, fine-tuning these models at scale remains challenging, particularly in federated environments where data is decentralized and non-iid across clients. Existing parameter-efficient tuning methods like LoRA (Low-Rank Adaptation) reduce computational overhead but struggle with heterogeneous client data, leading to suboptimal generalization. To address these challenges, we propose FedVLM, a federated LoRA fine-tuning framework that enables decentralized adaptation of VLMs while preserving model privacy and reducing reliance on centralized training. To further tackle data heterogeneity, we introduce personalized LoRA (pLoRA), which dynamically adapts LoRA parameters to each client's unique data distribution, significantly improving local adaptation while maintaining global model aggregation. Experiments on the RLAIF-V dataset show that pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings. FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.

Paper Structure

This paper contains 16 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Federated vs. Centralized Performance Analysis: We compare the convergence rates and accuracy of FedVLM and centralized training. FedVLM demonstrates faster convergence and higher accuracy, illustrating its effectiveness in FL environments.
  • Figure 2: Performance Analysis Against SOTA: pLoRA demonstrates substantial performance gains over both standard LoRA and FFA-LoRA, underscoring its effectiveness in FL settings.
  • Figure 3: Performance Analysis Against LoRA: pLoRA demonstrates performance gains over standard LoRA in CIFAR-10, underscoring its effectiveness in FL settings.
  • Figure 4: Performance Comparison Across Client: We show pLoRA's improvement over SOTA methods for each client in non-IID settings, demonstrating consistent benefits in personalized FL.
  • Figure 5: Comparison of Aggregation Methods: FedProx mitigates data distribution shifts among clients by incorporating a proximal term in local updates, enhancing model stability in federated settings.
  • ...and 2 more figures