Table of Contents
Fetching ...

LIA: Privacy-Preserving Data Quality Evaluation in Federated Learning Using a Lazy Influence Approximation

Ljubomir Rokvic, Panayiotis Danassis, Sai Praneeth Karimireddy, Boi Faltings

TL;DR

This work proposes a simple yet effective approach that utilizes a new influence approximation called "lazy influence" to filter and score data while preserving privacy, and has been shown to successfully filter out biased and corrupted data in various simulated and real-world settings.

Abstract

In Federated Learning, it is crucial to handle low-quality, corrupted, or malicious data. However, traditional data valuation methods are not suitable due to privacy concerns. To address this, we propose a simple yet effective approach that utilizes a new influence approximation called "lazy influence" to filter and score data while preserving privacy. To do this, each participant uses their own data to estimate the influence of another participant's batch and sends a differentially private obfuscated score to the central coordinator. Our method has been shown to successfully filter out biased and corrupted data in various simulated and real-world settings, achieving a recall rate of over $>90\%$ (sometimes up to $100\%$) while maintaining strong differential privacy guarantees with $\varepsilon \leq 1$.

LIA: Privacy-Preserving Data Quality Evaluation in Federated Learning Using a Lazy Influence Approximation

TL;DR

This work proposes a simple yet effective approach that utilizes a new influence approximation called "lazy influence" to filter and score data while preserving privacy, and has been shown to successfully filter out biased and corrupted data in various simulated and real-world settings.

Abstract

In Federated Learning, it is crucial to handle low-quality, corrupted, or malicious data. However, traditional data valuation methods are not suitable due to privacy concerns. To address this, we propose a simple yet effective approach that utilizes a new influence approximation called "lazy influence" to filter and score data while preserving privacy. To do this, each participant uses their own data to estimate the influence of another participant's batch and sends a differentially private obfuscated score to the central coordinator. Our method has been shown to successfully filter out biased and corrupted data in various simulated and real-world settings, achieving a recall rate of over (sometimes up to ) while maintaining strong differential privacy guarantees with .
Paper Structure (41 sections, 6 equations, 13 figures, 3 tables)

This paper contains 41 sections, 6 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Data filtering procedure. Data contributors $A_1$ and $A_2$ want to join the federation, but might have biased or corrupted data (e.g., watermarks on $A_1$ X-rays). $A_1$ sends a differentially private (DP) partially updated model ($\tilde{\theta}_t^{A_1}$) to validators $B_i$, who submit a DP vote based on the performance they observe when using the model $\tilde{\theta}_t^{A_1}$ on their task (Lazy Influence Approximation). The aggregated votes are used as a 'rite of passage', to decide whether to incorporate $A_1$'s data. The same process happens for $A_2$. $A_2$ is accepted in the federation, while $A_1$ is filtered out. Filtering can significantly improve the model's accuracy from 61% (no filtering) to 73% (proposed), matching the performance of the optimal (oracle) filtering on diagnosing heart conditions from X-rays using real-data from rajpurkar2017chexnet.
  • Figure 2: Filtering Poor Data Using the Lazy Influence Approximation (LIA) in FL
  • Figure 3: Visualization of the private voting scheme. The $x$-axis represents a contributor participant $A$. The $y$-axis shows the sum of all votes from all the validators, i.e., $\sum_{\forall B} \mathcal{I}_{LIA}(Z_{val}^B)$. Figure \ref{['fig:sub1']} corresponds to the sum of true votes (no privacy) for the valiation data of each contributor on the $x$-axis, while Figure \ref{['fig:sub2']} depicts the sum of differentially private votes ($\varepsilon=1$), according to randomized reporting algorithm. Finally, Figure \ref{['fig:sub3']} shows the filtration threshold, corresponding to the arithmetic mean of the two cluster centers (computed using k-means).
  • Figure 4: Model performance (model accuracy) over 25 communication rounds. 30% mislabel rate on CIFAR-10. The proposed (LIA) and oracle filters are used only once at the start ('right of passage' scenario). We compare a centralized model with no filtering (blue) to an FL model under perfect (oracle) filtering (orange), KRUM (red), Trimmed-mean (purple), Centered-Clipping(brown), our approach with FedAvg (green), and our approach with Centered-Clipping (pink). Note that the jagged line for KRUM is because only a single gradient is selected instead of performing FedAvg.
  • Figure 5: Recall and Precision on CIFAR 10, highly non-IID ($\alpha \rightarrow 0.1$), for increasing problem size (# of participants), and varying privacy guarantees (lower $\varepsilon$ provides stronger privacy). $\delta = 10^{-5}$.
  • ...and 8 more figures