Table of Contents
Fetching ...

Recovering Global Data Distribution Locally in Federated Learning

Ziyu Yao

TL;DR

The paper tackles label distribution skew in Federated Learning by proposing ReGL, a framework that recovers the global data distribution locally on each client. It leverages foundation generative models to synthesize images for minority and missing classes in a training-free approach, and further enhances alignment with local data through adaptive LoRA-based fine-tuning that incorporates multimodal conditioning. By combining real and synthetic data, ReGL enables FedAvg-style aggregation to achieve near-centralized performance in global generalization and superior personalization, outperforming state-of-the-art baselines across multiple datasets and skew settings. The results demonstrate robust improvement under extreme skew, missing classes, and high client counts, offering a privacy-preserving, scalable solution for FL with label distribution skew.

Abstract

Federated Learning (FL) is a distributed machine learning paradigm that enables collaboration among multiple clients to train a shared model without sharing raw data. However, a major challenge in FL is the label imbalance, where clients may exclusively possess certain classes while having numerous minority and missing classes. Previous works focus on optimizing local updates or global aggregation but ignore the underlying imbalanced label distribution across clients. In this paper, we propose a novel approach ReGL to address this challenge, whose key idea is to Recover the Global data distribution Locally. Specifically, each client uses generative models to synthesize images that complement the minority and missing classes, thereby alleviating label imbalance. Moreover, we adaptively fine-tune the image generation process using local real data, which makes the synthetic images align more closely with the global distribution. Importantly, both the generation and fine-tuning processes are conducted at the client-side without leaking data privacy. Through comprehensive experiments on various image classification datasets, we demonstrate the remarkable superiority of our approach over existing state-of-the-art works in fundamentally tackling label imbalance in FL.

Recovering Global Data Distribution Locally in Federated Learning

TL;DR

The paper tackles label distribution skew in Federated Learning by proposing ReGL, a framework that recovers the global data distribution locally on each client. It leverages foundation generative models to synthesize images for minority and missing classes in a training-free approach, and further enhances alignment with local data through adaptive LoRA-based fine-tuning that incorporates multimodal conditioning. By combining real and synthetic data, ReGL enables FedAvg-style aggregation to achieve near-centralized performance in global generalization and superior personalization, outperforming state-of-the-art baselines across multiple datasets and skew settings. The results demonstrate robust improvement under extreme skew, missing classes, and high client counts, offering a privacy-preserving, scalable solution for FL with label distribution skew.

Abstract

Federated Learning (FL) is a distributed machine learning paradigm that enables collaboration among multiple clients to train a shared model without sharing raw data. However, a major challenge in FL is the label imbalance, where clients may exclusively possess certain classes while having numerous minority and missing classes. Previous works focus on optimizing local updates or global aggregation but ignore the underlying imbalanced label distribution across clients. In this paper, we propose a novel approach ReGL to address this challenge, whose key idea is to Recover the Global data distribution Locally. Specifically, each client uses generative models to synthesize images that complement the minority and missing classes, thereby alleviating label imbalance. Moreover, we adaptively fine-tune the image generation process using local real data, which makes the synthetic images align more closely with the global distribution. Importantly, both the generation and fine-tuning processes are conducted at the client-side without leaking data privacy. Through comprehensive experiments on various image classification datasets, we demonstrate the remarkable superiority of our approach over existing state-of-the-art works in fundamentally tackling label imbalance in FL.
Paper Structure (35 sections, 3 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 35 sections, 3 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) Imbalanced label distribution for one client. (b) Recovered data distribution at client-side. The blue and orange histogram represent the number of real and synthetic images. (c) The comparison between our method and FedAvg under both IID and non-IID settings.
  • Figure 2: Our proposed ReGL framework. We use generative models to generate data at the client-side, thereby alleviating label imbalance. To better recover the global distribution, clients fine-tune their generative models using local data. Both real and synthetic data would be used to update the local models, resulting in a more balanced global aggregation model.
  • Figure 3: The data distributions of clients after data partition, where 0.01 and 0.5 are the $\beta$ values. The color bar shows the quantity of samples, and each rectangle represents the quantity of samples of a particular class in a client. Here we take 10-cateogry ImageFruit and 100-category ImageNet100 as examples.
  • Figure 4: T-SNE visualization on owning and missing classes. (a) FedAvg mixes all samples indiscriminately, while (b) our method can effectively distinguish them.
  • Figure 5: Performance with various synthetic data volume.
  • ...and 3 more figures