Table of Contents
Fetching ...

Lightweight Unsupervised Federated Learning with Pretrained Vision Language Model

Hao Yan, Yuhong Guo

TL;DR

The paper tackles privacy-preserving federated learning on edge devices lacking labeled data. It introduces FST-CBDG, which fixes the CLIP image encoder, initializes a lightweight linear classifier with class-text prototypes, and uses self-training plus class-balanced Gaussian data generation to refine pseudo-labels. Results on CIFAR-10/100 and CINIC-10 show substantial gains over CLIP zero-shot and even competitive performance against supervised FL under limited computation and communication. The method addresses data heterogeneity and resource constraints, offering a practical path for edge-friendly FL with vision-language priors.

Abstract

Federated learning aims to tackle the ``isolated data island" problem, where it trains a collective model from physically isolated clients while safeguarding the privacy of users' data. However, supervised federated learning necessitates that each client labels their data for training, which can be both time-consuming and resource-intensive, and may even be impractical for edge devices. Moreover, the training and transmission of deep models present challenges to the computation and communication capabilities of the clients. To address these two inherent challenges in supervised federated learning, we propose a novel lightweight unsupervised federated learning approach that leverages unlabeled data on each client to perform lightweight model training and communication by harnessing pretrained vision-language models, such as CLIP. By capitalizing on the zero-shot prediction capability and the well-trained image encoder of the pre-trained CLIP model, we have carefully crafted an efficient and resilient self-training approach. This method refines the initial zero-shot predicted pseudo-labels of unlabeled instances through the sole training of a linear classifier on top of the fixed image encoder. Additionally, to address data heterogeneity within each client, we propose a class-balanced text feature sampling strategy for generating synthetic instances in the feature space to support local training. Experiments are conducted on multiple benchmark datasets. The experimental results demonstrate that our proposed method greatly enhances model performance in comparison to CLIP's zero-shot predictions and even outperforms supervised federated learning benchmark methods given limited computational and communication overhead.

Lightweight Unsupervised Federated Learning with Pretrained Vision Language Model

TL;DR

The paper tackles privacy-preserving federated learning on edge devices lacking labeled data. It introduces FST-CBDG, which fixes the CLIP image encoder, initializes a lightweight linear classifier with class-text prototypes, and uses self-training plus class-balanced Gaussian data generation to refine pseudo-labels. Results on CIFAR-10/100 and CINIC-10 show substantial gains over CLIP zero-shot and even competitive performance against supervised FL under limited computation and communication. The method addresses data heterogeneity and resource constraints, offering a practical path for edge-friendly FL with vision-language priors.

Abstract

Federated learning aims to tackle the ``isolated data island" problem, where it trains a collective model from physically isolated clients while safeguarding the privacy of users' data. However, supervised federated learning necessitates that each client labels their data for training, which can be both time-consuming and resource-intensive, and may even be impractical for edge devices. Moreover, the training and transmission of deep models present challenges to the computation and communication capabilities of the clients. To address these two inherent challenges in supervised federated learning, we propose a novel lightweight unsupervised federated learning approach that leverages unlabeled data on each client to perform lightweight model training and communication by harnessing pretrained vision-language models, such as CLIP. By capitalizing on the zero-shot prediction capability and the well-trained image encoder of the pre-trained CLIP model, we have carefully crafted an efficient and resilient self-training approach. This method refines the initial zero-shot predicted pseudo-labels of unlabeled instances through the sole training of a linear classifier on top of the fixed image encoder. Additionally, to address data heterogeneity within each client, we propose a class-balanced text feature sampling strategy for generating synthetic instances in the feature space to support local training. Experiments are conducted on multiple benchmark datasets. The experimental results demonstrate that our proposed method greatly enhances model performance in comparison to CLIP's zero-shot predictions and even outperforms supervised federated learning benchmark methods given limited computational and communication overhead.
Paper Structure (15 sections, 8 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 8 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Framework of the proposed FST-CBDG method for lightweight unsupervised federated learning. In the server preparation stage, the CLIP image encoder and the categorical text features extracted using the CLIP text encoder are distributed to each client. During local training, extracted image features from the fixed CLIP image encoder are used for self-training of the linear classifier. Synthetic instances are generated in the feature space via class-balanced Gaussian sampling to address the data heterogeneity problem.
  • Figure 2: Entropy distribution of predicted probability vectors. Green dots represents the entropy for each sample and red line denotes the upper bound of the entropy ($\log 10 \approx 3.322$).
  • Figure 3: Curves of the testing accuracy (%) w.r.t. communication rounds for the proposed method FST-CBDG and the two comparison methods, FedAv and FedNTD under homogeneous (i.i.d.) and heterogeneous (sharding) data distribution.