Table of Contents
Fetching ...

Stratify: Rethinking Federated Learning for Non-IID Data through Balanced Sampling

Hui Yeok Wong, Chee Kau Lim, Chee Seng Chan

TL;DR

This work tackles non-IID data in Federated Learning by reframing training around a Stratified Label Schedule (SLS) that ensures balanced label exposure, complemented by label-aware client selection and high-frequency, fine-grained updates. It provides a theoretical basis for an unbiased gradient estimator with variance reduction and introduces a CKKS-based privacy protocol to compute global label statistics without exposing client data. A custom Batch Normalization strategy stabilizes learning under non-IID conditions, extending applicability to BN-enabled models. Empirically, Stratify achieves near-IID accuracy, faster convergence, and lower per-client computation across a range of datasets, demonstrating practical viability in heterogeneous FL environments.

Abstract

Federated Learning (FL) on non-independently and identically distributed (non-IID) data remains a critical challenge, as existing approaches struggle with severe data heterogeneity. Current methods primarily address symptoms of non-IID by applying incremental adjustments to Federated Averaging (FedAvg), rather than directly resolving its inherent design limitations. Consequently, performance significantly deteriorates under highly heterogeneous conditions, as the fundamental issue of imbalanced exposure to diverse class and feature distributions remains unresolved. This paper introduces Stratify, a novel FL framework designed to systematically manage class and feature distributions throughout training, effectively tackling the root cause of non-IID challenges. Inspired by classical stratified sampling, our approach employs a Stratified Label Schedule (SLS) to ensure balanced exposure across labels, significantly reducing bias and variance in aggregated gradients. Complementing SLS, we propose a label-aware client selection strategy, restricting participation exclusively to clients possessing data relevant to scheduled labels. Additionally, Stratify incorporates a fine-grained, high-frequency update scheme, accelerating convergence and further mitigating data heterogeneity. To uphold privacy, we implement a secure client selection protocol leveraging homomorphic encryption, enabling precise global label statistics without disclosing sensitive client information. Extensive evaluations on MNIST, CIFAR-10, CIFAR-100, Tiny-ImageNet, COVTYPE, PACS, and Digits-DG demonstrate that Stratify attains performance comparable to IID baselines, accelerates convergence, and reduces client-side computation compared to state-of-the-art methods, underscoring its practical effectiveness in realistic federated learning scenarios.

Stratify: Rethinking Federated Learning for Non-IID Data through Balanced Sampling

TL;DR

This work tackles non-IID data in Federated Learning by reframing training around a Stratified Label Schedule (SLS) that ensures balanced label exposure, complemented by label-aware client selection and high-frequency, fine-grained updates. It provides a theoretical basis for an unbiased gradient estimator with variance reduction and introduces a CKKS-based privacy protocol to compute global label statistics without exposing client data. A custom Batch Normalization strategy stabilizes learning under non-IID conditions, extending applicability to BN-enabled models. Empirically, Stratify achieves near-IID accuracy, faster convergence, and lower per-client computation across a range of datasets, demonstrating practical viability in heterogeneous FL environments.

Abstract

Federated Learning (FL) on non-independently and identically distributed (non-IID) data remains a critical challenge, as existing approaches struggle with severe data heterogeneity. Current methods primarily address symptoms of non-IID by applying incremental adjustments to Federated Averaging (FedAvg), rather than directly resolving its inherent design limitations. Consequently, performance significantly deteriorates under highly heterogeneous conditions, as the fundamental issue of imbalanced exposure to diverse class and feature distributions remains unresolved. This paper introduces Stratify, a novel FL framework designed to systematically manage class and feature distributions throughout training, effectively tackling the root cause of non-IID challenges. Inspired by classical stratified sampling, our approach employs a Stratified Label Schedule (SLS) to ensure balanced exposure across labels, significantly reducing bias and variance in aggregated gradients. Complementing SLS, we propose a label-aware client selection strategy, restricting participation exclusively to clients possessing data relevant to scheduled labels. Additionally, Stratify incorporates a fine-grained, high-frequency update scheme, accelerating convergence and further mitigating data heterogeneity. To uphold privacy, we implement a secure client selection protocol leveraging homomorphic encryption, enabling precise global label statistics without disclosing sensitive client information. Extensive evaluations on MNIST, CIFAR-10, CIFAR-100, Tiny-ImageNet, COVTYPE, PACS, and Digits-DG demonstrate that Stratify attains performance comparable to IID baselines, accelerates convergence, and reduces client-side computation compared to state-of-the-art methods, underscoring its practical effectiveness in realistic federated learning scenarios.

Paper Structure

This paper contains 27 sections, 10 equations, 4 figures, 6 tables, 4 algorithms.

Figures (4)

  • Figure 1: Overview of Stratify Training Process. (a) Single-sample learning: (a1) Send initialized global model parameter (only first client), masked labels to train, and next client address to current client, (a2) Convert masked label to real label, (a3) Update model sequentially on the required labels, (a4) Signal server to send next masked labels to train to next client, (a5) Send updated model to next client, Repeat step a1 to a5 until all masked labels to train are pulled from $SLS$, (a6) Send the final updated model to server once the last client completes its training and (a7) Broadcast the received global model to all clients. (b) Batch-data learning: (b1) Send masked labels to train to all selected clients for the current batch, (b2) Convert masked label to real label, (b3) Compute summed gradient, (b4) Return summed gradient, and (b5) Compute a unified gradient to update global model by summing up the clients' summed gradients and then dividing it with the batch size, and send the updated model to all selected clients in next batch. Repeat step b1 to b5 until all masked labels to train are pulled from $SLS$.
  • Figure 2: Comparison of algorithms' performance with increasing client numbers on different datasets and data partitions in single-sample learning
  • Figure 3: Comparison of algorithms' performance with increasing client numbers on different datasets and data partitions in batch-data learning
  • Figure 4: FedTAN and Stratify performance on CIFAR-10 across various non-IID settings