The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning

Justin Kang; Ramtin Pedarsani; Kannan Ramchandran

The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning

Justin Kang, Ramtin Pedarsani, Kannan Ramchandran

TL;DR

This work addresses fair valuation and incentive design for data contributed under heterogeneous privacy constraints in federated learning. It introduces two axiomatic fairness notions—one coequal for the platform and users, and one among users only—rooted in Shapley-value-like decompositions, extended to privacy-aware utilities $U(\boldsymbol{\rho})$. Through a heterogeneous privacy framework and mean-estimation examples, the paper reveals three regimes of platform behavior as privacy sensitivity varies, and provides mechanism-design algorithms to compute Nash equilibria under fair payments. The results illuminate how data quantity, privacy level, and heterogeneity jointly determine fair payments and platform strategies, offering a principled baseline for privacy-aware data markets and FL incentive design. The practical impact lies in guiding regulators and platforms toward transparent, fair, and efficient data acquisition policies under realistic privacy constraints.

Abstract

Modern data aggregation often involves a platform collecting data from a network of users with various privacy options. Platforms must solve the problem of how to allocate incentives to users to convince them to share their data. This paper puts forth an idea for a \textit{fair} amount to compensate users for their data at a given privacy level based on an axiomatic definition of fairness, along the lines of the celebrated Shapley value. To the best of our knowledge, these are the first fairness concepts for data that explicitly consider privacy constraints. We also formulate a heterogeneous federated learning problem for the platform with privacy level options for users. By studying this problem, we investigate the amount of compensation users receive under fair allocations with different privacy levels, amounts of data, and degrees of heterogeneity. We also discuss what happens when the platform is forced to design fair incentives. Under certain conditions we find that when privacy sensitivity is low, the platform will set incentives to ensure that it collects all the data with the lowest privacy options. When the privacy sensitivity is above a given threshold, the platform will provide no incentives to users. Between these two extremes, the platform will set the incentives so some fraction of the users chooses the higher privacy option and the others chooses the lower privacy option.

The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning

TL;DR

. Through a heterogeneous privacy framework and mean-estimation examples, the paper reveals three regimes of platform behavior as privacy sensitivity varies, and provides mechanism-design algorithms to compute Nash equilibria under fair payments. The results illuminate how data quantity, privacy level, and heterogeneity jointly determine fair payments and platform strategies, offering a principled baseline for privacy-aware data markets and FL incentive design. The practical impact lies in guiding regulators and platforms toward transparent, fair, and efficient data acquisition policies under realistic privacy constraints.

Abstract

Paper Structure (53 sections, 5 theorems, 81 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 53 sections, 5 theorems, 81 equations, 9 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Economics
Privacy
Optimal Data Acquisition
Fairness
Main Contributions
Notation
PROBLEM SETTING
Privacy Levels and Utility Functions
Differences from prior work
The Data Acquisition Problem
Model Limitations
Known sensitivity functions
Data-correlated sensitivity
...and 38 more sections

Key Result

Theorem 1

Let $\phi_p(z, \boldsymbol{\epsilon})$ and $\phi_i(z, \boldsymbol{\epsilon})$ satisfying axioms (A.i-iii) represent the portion of total utility awarded to the platform and each user $i$ from utility $U(z, \boldsymbol{\epsilon})$. Then they are unique and take the form:

Figures (9)

Figure 1: Depiction of interactions between platform and users. Users generate data with phones, cameras, vehicles, and drones. This data goes to the platform but requires some level of privacy. The platform uses this data to generate utility, often by using the data for learning tasks. In return, the platform may provide the users with payments in the form of access to services, discounts on products, or monetary compensation.
Figure 2: Users choose between three levels of privacy. If $\rho_i = 0$, users send no data to the platform. If $\rho_i= 1$, a user's model is securely combined with other users who also choose $\rho_i=1$, and the platform receives only the combined model. If $\rho_i = 2$, users send their relevant information directly to the platform.
Figure 3: Users send their data $x_i$ and a privacy level $\rho_i$ to the central platform in exchange for payments $t_i(\rho_i;\boldsymbol{\rho}_{-i})$. The central platform extracts utility from the data at a given privacy level and optimizes incentives to maximize the difference between the utility and the sum of payments $U(\boldsymbol{\rho}) - \sum_i t_i(\rho)$.
Figure 4: Each user $i \in [N]$ has mean and variance $(\theta_i, \sigma_i^2) \sim \Theta$, where $\Theta$ is a global joint distribution. Let $s^2 = \mathrm{Var}(\theta_i)$ and $r^2 = \mathbb{E}[\sigma_i^2]$ for all $i$. In this case $s^2$ is large relative to $r^2$, and the data is very heterogeneous.
Figure 5: (a) Plot of difference from the average utility per user $U(\boldsymbol{\rho})/N$ for each of the four different types of users, for three different regimes of $s^2 = \mathrm{Var}(\theta_i)$ and $r^2 = \mathbb{E}[\sigma_i^2]$, with heterogeneity decreasing from left to right. In left (most heterogeneous) plot users who choose $\rho_i = 2$ are more valuable compared to those that choose $\rho_1 = 1$. In the center there is an intermediate regime, where all users are paid closer to the average, with users with more data being favored slightly. In the rightmost graph, with little heterogeneity users with more data are paid more, and privacy level has a lesser impact on the payments. (b) In each case there is one user $i$ with $a_i = 100$ (indicated with a star), while all other users $j \neq i$ have $a_j = 1$ ( $a_i$ represents the relative importance of the user in the utility function). In the two leftmost set of bars, we see that the user with $\rho_i=2$ and $n_i=100$ receives by far the most payment, when heterogeneity is high, but this becomes less dramatic as heterogeneity decreases. This shows that when users are very heterogeneous, if $a_i$ is large for only user $i$, most of the benefit in terms of additional payments should go to user $i$. Likewise, comparing the second from the left and the rightmost plots we see little difference, showing that the opposite is true in the homogeneous case: any user can benefit from any other user having a large $a_i$.
...and 4 more figures

Theorems & Definitions (9)

Definition 1
Definition 2
Theorem 1
Theorem 2
Theorem 3
Proposition 4
proof
Proposition 5
proof

The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning

TL;DR

Abstract

The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)