Beyond performance-wise Contribution Evaluation in Federated Learning

Balazs Pejo

Beyond performance-wise Contribution Evaluation in Federated Learning

Balazs Pejo

TL;DR

The results reveal that no single client excels across all dimensions, which are largely independent from each other, highlighting a critical flaw in current evaluation scheme: no single metric is adequate for comprehensive evaluation and equitable rewarding allocation.

Abstract

Federated learning offers a privacy-friendly collaborative learning framework, yet its success, like any joint venture, hinges on the contributions of its participants. Existing client evaluation methods predominantly focus on model performance, such as accuracy or loss, which represents only one dimension of a machine learning model's overall utility. In contrast, this work investigates the critical, yet overlooked, issue of client contributions towards a model's trustworthiness -- specifically, its reliability (tolerance to noisy data), resilience (resistance to adversarial examples), and fairness (measured via demographic parity). To quantify these multifaceted contributions, we employ the state-of-the-art approximation of the Shapley value, a principled method for value attribution. Our results reveal that no single client excels across all dimensions, which are largely independent from each other, highlighting a critical flaw in current evaluation scheme: no single metric is adequate for comprehensive evaluation and equitable rewarding allocation.

Beyond performance-wise Contribution Evaluation in Federated Learning

TL;DR

Abstract

Paper Structure (29 sections, 7 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 7 equations, 12 figures, 5 tables, 1 algorithm.

Introduction
Contributions.
Organization.
Related Works
Trustworthy AI
Fairness.
Robustness.
Contribution Evaluation
Fairness.
Robustness.
Trustworthiness Scores
Notation.
Contribution Evaluation.
Performance Scoring.
Fairness Scoring.
...and 14 more sections

Figures (12)

Figure 1: Example to illustrate what the different trustworthiness scores capture based on $D^\prime=\{(1,G),(2,R),(3,G),(4,R),(5,R),(6,G)\}$. While the model is $0.67$ accurate, its trustworthiness scores are widely different, e.g., $\mathtt{fair}(M)=0.33$ when Demographic Parity is used where the square numbers (1,2,4) are protected and 'Red' is the target class. Regarding reliability, the model might misclassify three (1,3,4) samples with various probabilities, resulting in $\mathtt{rel}(M)=0.09$ overall chance. Regarding resilience, the attacker is successful once (1) out of the four (1,2,5,6) correctly classified samples, i.e., $\mathtt{res}(M)=0.25$. Note, in the paper we transform these trustworthy metrics (i.e., $m\rightarrow(1-m)$) to be similar to accuracy: higher values mean better model.
Figure 2: Score evolution of the model as the training progresses for the non-IID scenario.
Figure 3: Computed scores for the 4 non-IID clients.
Figure 4: Score distributions of the 20 non-IID clients.
Figure 5: Heatmap of the pair-wise score correlations for the 4 non-IID client setting. .
...and 7 more figures

Theorems & Definitions (1)

Definition 1: Demographic Parity calders2009building

Beyond performance-wise Contribution Evaluation in Federated Learning

TL;DR

Abstract

Beyond performance-wise Contribution Evaluation in Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (1)