Table of Contents
Fetching ...

Incentivising the federation: gradient-based metrics for data selection and valuation in private decentralised training

Dmitrii Usynin, Daniel Rueckert, Georgios Kaissis

TL;DR

Problem: obtaining high-quality data for collaborative training under privacy regulations and providing incentives to data owners. Approach: gradient-based data valuation using VoG and PLIS to identify useful samples in private decentralized training, with normalization and potential differential privacy publication of the scores. Contributions: formal definitions of VoG and PLIS, extensive experiments across CIFAR-10, CINIC-10, and PPPD with both $(\varepsilon,\delta)$-DP and non-DP settings, and comparisons against per-sample losses and $L_2$ gradient norms; analysis of model size, dataset, and the relationship between VoG and PLIS. Impact: enables principled data selection and data-owner incentivisation in privacy-preserving FL, offering a practical, DP-friendly pathway for reward allocation and data marketplace formation.

Abstract

Obtaining high-quality data for collaborative training of machine learning models can be a challenging task due to A) regulatory concerns and B) a lack of data owner incentives to participate. The first issue can be addressed through the combination of distributed machine learning techniques (e.g. federated learning) and privacy enhancing technologies (PET), such as the differentially private (DP) model training. The second challenge can be addressed by rewarding the participants for giving access to data which is beneficial to the training model, which is of particular importance in federated settings, where the data is unevenly distributed. However, DP noise can adversely affect the underrepresented and the atypical (yet often informative) data samples, making it difficult to assess their usefulness. In this work, we investigate how to leverage gradient information to permit the participants of private training settings to select the data most beneficial for the jointly trained model. We assess two such methods, namely variance of gradients (VoG) and the privacy loss-input susceptibility score (PLIS). We show that these techniques can provide the federated clients with tools for principled data selection even in stricter privacy settings.

Incentivising the federation: gradient-based metrics for data selection and valuation in private decentralised training

TL;DR

Problem: obtaining high-quality data for collaborative training under privacy regulations and providing incentives to data owners. Approach: gradient-based data valuation using VoG and PLIS to identify useful samples in private decentralized training, with normalization and potential differential privacy publication of the scores. Contributions: formal definitions of VoG and PLIS, extensive experiments across CIFAR-10, CINIC-10, and PPPD with both -DP and non-DP settings, and comparisons against per-sample losses and gradient norms; analysis of model size, dataset, and the relationship between VoG and PLIS. Impact: enables principled data selection and data-owner incentivisation in privacy-preserving FL, offering a practical, DP-friendly pathway for reward allocation and data marketplace formation.

Abstract

Obtaining high-quality data for collaborative training of machine learning models can be a challenging task due to A) regulatory concerns and B) a lack of data owner incentives to participate. The first issue can be addressed through the combination of distributed machine learning techniques (e.g. federated learning) and privacy enhancing technologies (PET), such as the differentially private (DP) model training. The second challenge can be addressed by rewarding the participants for giving access to data which is beneficial to the training model, which is of particular importance in federated settings, where the data is unevenly distributed. However, DP noise can adversely affect the underrepresented and the atypical (yet often informative) data samples, making it difficult to assess their usefulness. In this work, we investigate how to leverage gradient information to permit the participants of private training settings to select the data most beneficial for the jointly trained model. We assess two such methods, namely variance of gradients (VoG) and the privacy loss-input susceptibility score (PLIS). We show that these techniques can provide the federated clients with tools for principled data selection even in stricter privacy settings.
Paper Structure (18 sections, 4 equations, 10 figures, 1 table)

This paper contains 18 sections, 4 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Distribution of normalised VoG scores for ResNet-$18$ (PPPD, $\varepsilon$=$4$). Higher values indicate atypical samples.
  • Figure 2: Comparison of images with largest VoGs for WideResNet-$50$ ($\varepsilon$=$4$) and ResNet-$101$ ($\varepsilon$=$4$) respectively (CIFAR-$10$, bird class). The trend for low contrast and more defined features is maintained across different model architectures (SSIM of $0.552$ and BD of $0.698$).
  • Figure 3: Comparison between different selection methods under normal and DP training (ResNet-$18$, CIFAR-$10$). Higher is better.
  • Figure 4: Comparison of images with largest gradient norms for DP and non-DP models respectively (ResNet-$18$, CIFAR-$10$, bird class, $\varepsilon=4$). There is little conceptual similarity between the chosen images (low correlation coefficients at different $\varepsilon$ values, the SSIM of $0.355$ and the BD of $0.914$).
  • Figure 5: Comparison of images with largest VoGs for ResNet-$18$ and ResNeXt-$101$ respectively (non-private models PPPD). Here there is some variation in the images (SSIM of $0.521$ and BD of $0.320$) based on the size of the model even in non-private settings.
  • ...and 5 more figures

Theorems & Definitions (1)

  • definition 1: $(\varepsilon, \delta)$-DP, dpbook