FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning

Duy Phuong Nguyen; J. Pablo Munoz; Ali Jannesari

FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning

Duy Phuong Nguyen, J. Pablo Munoz, Ali Jannesari

TL;DR

This work tackles privacy and scalability in vision-language models by proposing FLoRA, a federated fine-tuning framework that applies Low-Rank Adaptation (LoRA) adapters to CLIP. By updating only the text-encoder LoRA parameters and aggregating with FedAvg-like server updates, FLoRA achieves substantial communication and memory savings while maintaining or improving accuracy across IID and non-IID settings. Extensive experiments across a wide range of datasets, including few-shot and pathological non-IID scenarios, demonstrate that FLoRA outperforms traditional FL baselines and offers robust, data-efficient performance. The approach delivers practical benefits for privacy-preserving, distributed multimodal learning with significantly reduced training time and bandwidth requirements.

Abstract

In the rapidly evolving field of artificial intelligence, multimodal models, e.g., integrating vision and language into visual-language models (VLMs), have become pivotal for many applications, ranging from image captioning to multimodal search engines. Among these models, the Contrastive Language-Image Pre-training (CLIP) model has demonstrated remarkable performance in understanding and generating nuanced relationships between text and images. However, the conventional training of such models often requires centralized aggregation of vast datasets, posing significant privacy and data governance challenges. To address these concerns, this paper proposes a novel approach that leverages Federated Learning and parameter-efficient adapters, i.e., Low-Rank Adaptation (LoRA), to train VLMs. This methodology preserves data privacy by training models across decentralized data sources and ensures model adaptability and efficiency through LoRA's parameter-efficient fine-tuning. Our approach accelerates training time by up to 34.72 times and requires 2.47 times less memory usage than full fine-tuning.

FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning

TL;DR

Abstract

Paper Structure (30 sections, 3 equations, 11 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 3 equations, 11 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Federated Learning
Vision-Language Models
Contrastive Language Image Pre-training (CLIP)
Federated Learning for Vision-Language Models
Client-Side Training
Server Aggregation
Low-Rank Adaptation for Vision-Language Models
Methodology
Baseline fine-tuning methods
Full fine-tuning (FFT)
Linear classifier (LC)
Fine-tune vision model with linear classifier (VM-LC)
Attention Adapter (AA)
...and 15 more sections

Figures (11)

Figure 1: This schematic illustrates the critical steps in federated learning. In each round of federated learning, clients first downloads the transferred model from the server and independently train models on their own data (step ➊). Then, locally-trained models are then sent to a central server (step ➋). After that, the server aggregates these models to form a average global model (step ➌), which is subsequently redistributed to the clients for further training or inference (step ➍).
Figure 2: Different methods for local model update. The clients download the transferred model from the server, train locally with the local data, and upload it to the server. FLoRA: only transfer LoRA Adapter; FedAA: only transfer Attention Adapter; FedLC: only transfer Linear Classfier; FedVM-LC: only transfer Vision Model and Linear Classfier; FedFFT: transfer the whole CLIP model (full fine-tuning). Our method (FLoRA) is highlighted in the green box. The figure is best viewed in color.
Figure 3: Schematic diagram of LoRA finetuning Mechanism. This figure demonstrates the LoRA fine-tuning process applied to a pre-trained model. Input $x$ is processed through low-rank matrices $\boldsymbol{A}$ and $\boldsymbol{B}$, which are the core components of the LoRA approach. These matrices modify the pre-trained weights $\boldsymbol{W}$ by a scaling factor $\alpha$, allowing for precise adjustments to the model without the need to retrain the entire network. This targeted fine-tuning strategy leads to efficient adaptation and output generation.
Figure 4: The data distribution of all clients on CIFAR-10 in IID and non-IID setting with varying $\beta$. The size of a circle means the number of data samples.
Figure 5: Learning curve for DTD dataset in different settings.
...and 6 more figures

FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning

TL;DR

Abstract

FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)