Differentially Private Reinforcement Learning with Self-Play

Dan Qiao; Yu-Xiang Wang

Differentially Private Reinforcement Learning with Self-Play

Dan Qiao, Yu-Xiang Wang

TL;DR

The definitions of Joint DP (JDP) and Local DP (LDP) are extended to two-player zero-sum episodic Markov Games and a provably efficient algorithm is designed based on optimistic Nash value iteration and privatization of Bernstein-type bonuses to understand trajectory-wise privacy protection in multi-agent RL.

Abstract

We study the problem of multi-agent reinforcement learning (multi-agent RL) with differential privacy (DP) constraints. This is well-motivated by various real-world applications involving sensitive data, where it is critical to protect users' private information. We first extend the definitions of Joint DP (JDP) and Local DP (LDP) to two-player zero-sum episodic Markov Games, where both definitions ensure trajectory-wise privacy protection. Then we design a provably efficient algorithm based on optimistic Nash value iteration and privatization of Bernstein-type bonuses. The algorithm is able to satisfy JDP and LDP requirements when instantiated with appropriate privacy mechanisms. Furthermore, for both notions of DP, our regret bound generalizes the best known result under the single-agent RL case, while our regret could also reduce to the best known result for multi-agent RL without privacy constraints. To the best of our knowledge, these are the first line of results towards understanding trajectory-wise privacy protection in multi-agent RL.

Differentially Private Reinforcement Learning with Self-Play

TL;DR

Abstract

Paper Structure (21 sections, 17 theorems, 71 equations, 1 table)

This paper contains 21 sections, 17 theorems, 71 equations, 1 table.

Introduction
Related work
Problem Setup
Markov Games and Regret
Differential Privacy in Multi-agent RL
Algorithm
Main results
Privatizers for JDP and LDP
Central Privatizer for Joint DP
Local Privatizer for Local DP
The post-processing step
Some discussions
Proof overview
Conclusion
Extended related works
...and 6 more sections

Key Result

Theorem 4.1

For any privacy budget $\epsilon>0$, failure probability $\beta\in[0,1]$ and any Privatizer satisfying Assumption assump, with probability at least $1-\beta$, the regret of DP-Nash-VI (Algorithm alg:main) is bounded by where $K$ is the number of episodes and $T=HK$.

Theorems & Definitions (39)

Definition 2.1: Differential Privacy (DP)
Definition 2.2: Joint Differential Privacy (JDP)
Definition 2.3: Local Differential Privacy (LDP)
Remark 2.4
Remark 2.5
Remark 3.2
Theorem 4.1
Theorem 4.2
Lemma 5.1
Theorem 5.2: Results under JDP
...and 29 more

Differentially Private Reinforcement Learning with Self-Play

TL;DR

Abstract

Differentially Private Reinforcement Learning with Self-Play

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (39)