Table of Contents
Fetching ...

Local Data Quantity-Aware Weighted Averaging for Federated Learning with Dishonest Clients

Leming Wu, Yaochu Jin, Kuangrong Hao, Han Yu

TL;DR

This work addresses the vulnerability of server-side weighted aggregation in Federated Learning to dishonest client data-volume reporting by introducing FedDua, which adds a client-side quantity-aware branch to predict an adjustment factor $\alpha$ from local updates $\Delta\theta$, learning rate $\eta$, and the expected gradient $\mathbb{E}[\nabla L(\theta_i)]$. The server verifies reported data volumes by comparing $\alpha$ against a pre-trained distribution, flagging or excluding dishonest clients, and allowing aggregation to proceed based on predicted data contributions when necessary. The approach is encapsulated by the relations $\alpha = f_{dua}(\varphi; embedding(\text{client}_i))$ and $\text{Loss}_{dua} = \frac{1}{2}\left\| \frac{\Delta \theta_i}{\eta \mathbb{E}[\nabla L(\theta_i)] \alpha} - |D_i| \right\|^2$, with $R \approx \frac{\Delta \theta_i}{\eta \mathbb{E}[\nabla L(\theta_i)] \alpha}$ linking updates to data volume. Empirical results on CIFAR-10 and MedMNIST show FedDua yields an average improvement of $3.17\%$ over four popular FL aggregators in the presence of inaccurate data declarations, while incurring only modest client-side overhead and no extra communication. The method is modular and can be integrated into existing FL algorithms to enhance robustness against data-volume manipulation, with future work extending to data quality considerations.

Abstract

Federated learning (FL) enables collaborative training of deep learning models without requiring data to leave local clients, thereby preserving client privacy. The aggregation process on the server plays a critical role in the performance of the resulting FL model. The most commonly used aggregation method is weighted averaging based on the amount of data from each client, which is thought to reflect each client's contribution. However, this method is prone to model bias, as dishonest clients might report inaccurate training data volumes to the server, which is hard to verify. To address this issue, we propose a novel secure \underline{Fed}erated \underline{D}ata q\underline{u}antity-\underline{a}ware weighted averaging method (FedDua). It enables FL servers to accurately predict the amount of training data from each client based on their local model gradients uploaded. Furthermore, it can be seamlessly integrated into any FL algorithms that involve server-side model aggregation. Extensive experiments on three benchmarking datasets demonstrate that FedDua improves the global model performance by an average of 3.17% compared to four popular FL aggregation methods in the presence of inaccurate client data volume declarations.

Local Data Quantity-Aware Weighted Averaging for Federated Learning with Dishonest Clients

TL;DR

This work addresses the vulnerability of server-side weighted aggregation in Federated Learning to dishonest client data-volume reporting by introducing FedDua, which adds a client-side quantity-aware branch to predict an adjustment factor from local updates , learning rate , and the expected gradient . The server verifies reported data volumes by comparing against a pre-trained distribution, flagging or excluding dishonest clients, and allowing aggregation to proceed based on predicted data contributions when necessary. The approach is encapsulated by the relations and , with linking updates to data volume. Empirical results on CIFAR-10 and MedMNIST show FedDua yields an average improvement of over four popular FL aggregators in the presence of inaccurate data declarations, while incurring only modest client-side overhead and no extra communication. The method is modular and can be integrated into existing FL algorithms to enhance robustness against data-volume manipulation, with future work extending to data quality considerations.

Abstract

Federated learning (FL) enables collaborative training of deep learning models without requiring data to leave local clients, thereby preserving client privacy. The aggregation process on the server plays a critical role in the performance of the resulting FL model. The most commonly used aggregation method is weighted averaging based on the amount of data from each client, which is thought to reflect each client's contribution. However, this method is prone to model bias, as dishonest clients might report inaccurate training data volumes to the server, which is hard to verify. To address this issue, we propose a novel secure \underline{Fed}erated \underline{D}ata q\underline{u}antity-\underline{a}ware weighted averaging method (FedDua). It enables FL servers to accurately predict the amount of training data from each client based on their local model gradients uploaded. Furthermore, it can be seamlessly integrated into any FL algorithms that involve server-side model aggregation. Extensive experiments on three benchmarking datasets demonstrate that FedDua improves the global model performance by an average of 3.17% compared to four popular FL aggregation methods in the presence of inaccurate client data volume declarations.

Paper Structure

This paper contains 11 sections, 12 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: This experiment evaluates the global model accuracy of the FedAvg algorithm on the CIFAR-10 dataset, comparing scenarios with and without dishonest clients manipulating data volume. Two levels of Non-IID data distribution, defined by Dirichlet parameters $\beta = 0.1$ and $\beta = 0.5$, are considered.
  • Figure 2: The proposed FedDua approach architecture.
  • Figure 3: Distribution of $\alpha$ values under different communication rounds using data quantity-aware branch prediction by the client under different data amounts.
  • Figure 4: The proposed algorithm is tested with four FL methods: FedAvg, FedProx, Scaffold, and Ditto. Two scenarios are considered: (1) all clients behave honestly, and (2) one client uploads falsified data volumes. The evaluation measures test accuracy on the CIFAR-10, MedMNIST-PathMNIST, and MedMNIST-OrganaMNIST datasets. The datasets are non-IID, with a Dirichlet distribution ($\beta=0.5$) used to define the data distribution.