Gradient Coreset for Federated Learning

Durga Sivasubramanian; Lokesh Nagalapatti; Rishabh Iyer; Ganesh Ramakrishnan

Gradient Coreset for Federated Learning

Durga Sivasubramanian, Lokesh Nagalapatti, Rishabh Iyer, Ganesh Ramakrishnan

TL;DR

This work addresses the challenge of robust, energy-efficient learning in Federated Learning with non-$i.i.d.$ and noisy client data by introducing GCFL, a gradient-based coreset selection framework. GCFL constructs a per-client coreset of size $b$ guided by gradients derived from a small server validation set $D_S$, selecting coresets every $K$ rounds and computing updates from them, leveraging a greedy gradient-matching objective solved via Orthogonal Matching Pursuit. It enhances robustness to noise across attribute and label perturbations, balances class distributions through per-class (label-wise) selection, and reduces communication by broadcasting only targeted gradients, while preserving privacy by limiting server-to-client data exposure. Empirical results on CIFAR-10/100, Flowers, and FEMNIST show GCFL outperforms baselines in noisy settings and offers favorable accuracy-privacy-efficiency trade-offs, with modest overheads in computation and communication. The approach provides a practical, scalable solution for robust FL in resource-constrained and privacy-conscious deployments, demonstrating substantial gains when data are noisy and heterogeneously distributed.

Abstract

Federated Learning (FL) is used to learn machine learning models with data that is partitioned across multiple clients, including resource-constrained edge devices. It is therefore important to devise solutions that are efficient in terms of compute, communication, and energy consumption, while ensuring compliance with the FL framework's privacy requirements. Conventional approaches to these problems select a weighted subset of the training dataset, known as coreset, and learn by fitting models on it. Such coreset selection approaches are also known to be robust to data noise. However, these approaches rely on the overall statistics of the training data and are not easily extendable to the FL setup. In this paper, we propose an algorithm called Gradient based Coreset for Robust and Efficient Federated Learning (GCFL) that selects a coreset at each client, only every $K$ communication rounds and derives updates only from it, assuming the availability of a small validation dataset at the server. We demonstrate that our coreset selection technique is highly effective in accounting for noise in clients' data. We conduct experiments using four real-world datasets and show that GCFL is (1) more compute and energy efficient than FL, (2) robust to various kinds of noise in both the feature space and labels, (3) preserves the privacy of the validation dataset, and (4) introduces a small communication overhead but achieves significant gains in performance, particularly in cases when the clients' data is noisy.

Gradient Coreset for Federated Learning

TL;DR

This work addresses the challenge of robust, energy-efficient learning in Federated Learning with non-

and noisy client data by introducing GCFL, a gradient-based coreset selection framework. GCFL constructs a per-client coreset of size

guided by gradients derived from a small server validation set

, selecting coresets every

rounds and computing updates from them, leveraging a greedy gradient-matching objective solved via Orthogonal Matching Pursuit. It enhances robustness to noise across attribute and label perturbations, balances class distributions through per-class (label-wise) selection, and reduces communication by broadcasting only targeted gradients, while preserving privacy by limiting server-to-client data exposure. Empirical results on CIFAR-10/100, Flowers, and FEMNIST show GCFL outperforms baselines in noisy settings and offers favorable accuracy-privacy-efficiency trade-offs, with modest overheads in computation and communication. The approach provides a practical, scalable solution for robust FL in resource-constrained and privacy-conscious deployments, demonstrating substantial gains when data are noisy and heterogeneously distributed.

Abstract

communication rounds and derives updates only from it, assuming the availability of a small validation dataset at the server. We demonstrate that our coreset selection technique is highly effective in accounting for noise in clients' data. We conduct experiments using four real-world datasets and show that GCFL is (1) more compute and energy efficient than FL, (2) robust to various kinds of noise in both the feature space and labels, (3) preserves the privacy of the validation dataset, and (4) introduces a small communication overhead but achieves significant gains in performance, particularly in cases when the clients' data is noisy.

Paper Structure (29 sections, 7 equations, 13 figures, 3 tables, 2 algorithms)

This paper contains 29 sections, 7 equations, 13 figures, 3 tables, 2 algorithms.

Introduction
Motivating experiment
Related Work
Problem Setup
The Gcfl Solution Approach
Greedy solution to select Coreset \ref{['eq:gmobj']}
Label-wise Coreset Selection
Broadcasting Label-wise gradients
Experiments
Datasets
Baselines
Model Architecture and Experimental Setup
Robustness
Does Gcfl select clean data points?
Efficiency
...and 14 more sections

Figures (13)

Figure 1: Schematic overview of Gcfl. We illustrates a server with a limited validation dataset and multiple participating clients, which are edge devices with data that contain noise.
Figure 2: Performance of FedAvg, Gcfl, and skyline under 40% label noise. Skyline is trained just on the clean points. Gcfl performs comparably to the skyline.
Figure 3: This demonstrates the workflow of Gcfl for binary classification with blue and green classes. The server transmits the final layer gradients from the validation dataset $D_S$. The client employs the OMP algorithm to select a coreset $\mathcal{X}_i^t, w_i^t$, which is used to compute updates shared with the server.
Figure 4: Performance comparison of Gcfl and baselines with varying closed-set noise percentages. The X-axis indicates the introduced noise level, and the Y-axis shows test set accuracy. Notably, at x=0, no noise is present. Overall, Gcfl outperforms the baselines, except for the flowers dataset, where subset selection hurts.
Figure 5: Performance of Gcfl in presence of open set noise with 10% data subset. The legend is borrowed from the Fig \ref{['fig:coreset_closed']}.
...and 8 more figures

Gradient Coreset for Federated Learning

TL;DR

Abstract

Gradient Coreset for Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)