Incentives in Private Collaborative Machine Learning
Rachael Hwee Ling Sim, Yehong Zhang, Trong Nghia Hoang, Xinyi Xu, Bryan Kian Hsiang Low, Patrick Jaillet
TL;DR
This work tackles the problem of incentivizing participation in private collaborative ML under privacy constraints. It introduces a DP-aware valuation tied to Bayesian surprise, ensuring that stronger DP guarantees reduce data value on average, and embeds this within a ρ-Shapley fair reward scheme to preserve individual rationality and group welfare. To realize target rewards, two reward-control mechanisms are proposed: adding DP-aware noise to perturbed statistics or tempering the likelihood to interpolate between prior and grand-coalition posteriors, with tempering demonstrating superior stability and similarity to the grand-posterior. Empirical results on synthetic and real datasets show predictable privacy-valuation and privacy-reward trade-offs and highlight the practical benefits of likelihood tempering for maintaining model utility while respecting DP requirements.
Abstract
Collaborative machine learning involves training models on data from multiple parties but must incentivize their participation. Existing data valuation methods fairly value and reward each party based on shared data or model parameters but neglect the privacy risks involved. To address this, we introduce differential privacy (DP) as an incentive. Each party can select its required DP guarantee and perturb its sufficient statistic (SS) accordingly. The mediator values the perturbed SS by the Bayesian surprise it elicits about the model parameters. As our valuation function enforces a privacy-valuation trade-off, parties are deterred from selecting excessive DP guarantees that reduce the utility of the grand coalition's model. Finally, the mediator rewards each party with different posterior samples of the model parameters. Such rewards still satisfy existing incentives like fairness but additionally preserve DP and a high similarity to the grand coalition's posterior. We empirically demonstrate the effectiveness and practicality of our approach on synthetic and real-world datasets.
