Table of Contents
Fetching ...

Differentially Private Reward Functions in Policy Synthesis for Markov Decision Processes

Alexander Benvenuti, Calvin Hawkins, Brandon Fallin, Bo Chen, Brendan Bialy, Miriam Dennis, Matthew Hale

TL;DR

This paper addresses the risk of reward-function leakage in multi-agent MDPs by introducing two DP mechanisms to privatize rewards: input perturbation, where each agent adds Gaussian noise to its own reward vector, and output perturbation, where noise is added to the joint reward after aggregation. It proves $(\epsilon,\delta)$-DP guarantees for both methods, with input perturbation offering superior performance and requiring less trust in the aggregator. The authors derive accuracy bounds for privatized rewards, quantify the cost of privacy on policy performance and computation, and provide design guidelines to preserve critical goal/avoid state-action pairs after privatization. Numerical simulations across three examples demonstrate that reasonably strong privacy (e.g., $\epsilon \approx 1.3$) incurs only modest decreases in total reward (around 5%) and negligible increases in computation time (≈0.016%), highlighting a favorable privacy-utility trade-off in practice.

Abstract

Markov decision processes often seek to maximize a reward function, but onlookers may infer reward functions by observing the states and actions of such systems, revealing sensitive information. Therefore, in this paper we introduce and compare two methods for privatizing reward functions in policy synthesis for multi-agent Markov decision processes, which generalize Markov decision processes. Reward functions are privatized using differential privacy, a statistical framework for protecting sensitive data. The methods we develop perturb either (1) each agent's individual reward function or (2) the joint reward function shared by all agents. We show that approach (1) provides better performance. We then develop a polynomial-time algorithm for the numerical computation of the performance loss due to privacy on a case-by-case basis. Next, using approach (1), we develop guidelines for selecting reward function values to preserve ``goal" and ``avoid" states while still remaining private, and we quantify the increase in computational complexity needed to compute policies from privatized rewards. Numerical simulations are performed on three classes of systems and they reveal a surprising compatibility with privacy: using reasonably strong privacy ($ε=1.3$) on average induces as little as a~$5\%$ decrease in total accumulated reward and a $0.016\%$ increase in computation time.

Differentially Private Reward Functions in Policy Synthesis for Markov Decision Processes

TL;DR

This paper addresses the risk of reward-function leakage in multi-agent MDPs by introducing two DP mechanisms to privatize rewards: input perturbation, where each agent adds Gaussian noise to its own reward vector, and output perturbation, where noise is added to the joint reward after aggregation. It proves -DP guarantees for both methods, with input perturbation offering superior performance and requiring less trust in the aggregator. The authors derive accuracy bounds for privatized rewards, quantify the cost of privacy on policy performance and computation, and provide design guidelines to preserve critical goal/avoid state-action pairs after privatization. Numerical simulations across three examples demonstrate that reasonably strong privacy (e.g., ) incurs only modest decreases in total reward (around 5%) and negligible increases in computation time (≈0.016%), highlighting a favorable privacy-utility trade-off in practice.

Abstract

Markov decision processes often seek to maximize a reward function, but onlookers may infer reward functions by observing the states and actions of such systems, revealing sensitive information. Therefore, in this paper we introduce and compare two methods for privatizing reward functions in policy synthesis for multi-agent Markov decision processes, which generalize Markov decision processes. Reward functions are privatized using differential privacy, a statistical framework for protecting sensitive data. The methods we develop perturb either (1) each agent's individual reward function or (2) the joint reward function shared by all agents. We show that approach (1) provides better performance. We then develop a polynomial-time algorithm for the numerical computation of the performance loss due to privacy on a case-by-case basis. Next, using approach (1), we develop guidelines for selecting reward function values to preserve ``goal" and ``avoid" states while still remaining private, and we quantify the increase in computational complexity needed to compute policies from privatized rewards. Numerical simulations are performed on three classes of systems and they reveal a surprising compatibility with privacy: using reasonably strong privacy () on average induces as little as a~ decrease in total accumulated reward and a increase in computation time.
Paper Structure (21 sections, 7 theorems, 17 equations, 11 figures, 1 table)

This paper contains 21 sections, 7 theorems, 17 equations, 11 figures, 1 table.

Key Result

Theorem 1

Given privacy parameters $\epsilon>0$, $\delta\in[0, \frac{1}{2})$, and adjacency parameter $b>0$, the mapping from $\{r^i\}_{i\in[N]}$ to $\{\pi^{*,i}\}_{i\in[N]}$ defined by Algorithm algo:input keeps each $r^i$$(\epsilon,\delta)$-differentially private with respect to the adjacency relation in De

Figures (11)

  • Figure 1: The flow of information for (a) input perturbation and (b) output perturbation. In input perturbation, agent $i$ sends the aggregator its privatized reward function $\tilde{r}^i$ (shown by the lower arrow in (a)), while in output perturbation, agent $i$ sends the sensitive (non-privatized) reward function $r^i$ (shown by the lower arrow in (b)). By privatizing their rewards before sending them, agents using input perturbation have greater control of the strength of privacy used to protect their rewards, and they do not need to trust the aggregator with their sensitive reward.
  • Figure 2: Simulation of the bounds in (a) Theorem \ref{['theorem:cheb']} and (b) Corollary \ref{['corollary:choose_epsilon']} with $\delta = 0.01$, $nm = 8$, and $b = 1$. In (a), we see that Theorem \ref{['theorem:cheb']} captures the qualitative behavior of the empirically computed expected maximal error, while also providing tight bounds on the true maximal error. In (b), increasing privacy strength (decreasing $\epsilon$) from $\epsilon=3$ to $\epsilon = 2$ leads to negligible increases in maximum average error. Furthermore, the increase in maximum average error is small until $\epsilon\leq2,$ indicating marginal performance losses until privacy is relatively strong.
  • Figure 3: The probability that the "goal" state remains the same after the implementation of privacy as a function of the difference between reward values. Due to the symmetry between $\Phi(y)$ and $\mathcal{Q}(y)$, this bound is identical for a decreasing gap between the smallest and second smallest rewards. This bound maintains the same qualitative behavior as the true value, and thus allows users to control the probability that their reward functions have the same "goal" and "avoid" state-action pairs after privacy is applied.
  • Figure 4: The expected added computational complexity of computing a policy on a privatized reward function with (a) $R_{\text{max}} = 1$ and (b) $R_{\text{max}} = 10$, along with $\delta = 0.1$, $b = 1$, $n = 16$, $m = 16$, $\eta = 10^{-8}$, $\gamma = 0.99$, and a range of $\epsilon$'s. In both cases, one reward is set to $R_{\text{max}}$ and all others are set to $-R_{\text{max}}$. The bound is accurate over the whole range of $\epsilon$ and maintains the qualitative behavior of the true values, indicating that Theorem \ref{['thm:e_cost_of_privacy']} provides accurate estimates for the increase in computation time that privacy induces. With a larger maximum absolute reward value, there is only a minor increase in computation time to compute a policy with the private reward compared to computing a policy with the sensitive reward. Even with strong privacy, such as $\epsilon = 1$, we observe less than a $10\%$ increase in computation time in practice, indicating that strong privacy protections can be provided without significantly increasing computation time.
  • Figure 5: Agent $i$'s MDP in Example 1. Each agent starts in state 0 and only has 2 states and 2 actions. Taking action a in any state will return the same state with probability $p$ and will transition states with probability $1-p$. Taking action b will transition the agent to the other state with probability $p$ and remain in the same state with probability $1-p$.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Remark 4
  • Theorem 1: Solution to Problem \ref{['Prob:framework']}
  • Remark 5
  • Theorem 2: Alternative Solution to Problem \ref{['Prob:framework']}
  • Remark 6
  • Theorem 3: Solution to Problem \ref{['prob:accuracy']}
  • Corollary 1
  • Theorem 4: Solution to Problem \ref{['prob:cost_of_privacy']}
  • Theorem 5: Solution to Problem \ref{['prob:rewarddesign']}
  • Theorem 6: Solution to Problem \ref{['prob:cop2']}