Table of Contents
Fetching ...

Differentially Private Deep Model-Based Reinforcement Learning

Alexandre Rio, Merwan Barlier, Igor Colin, Albert Thomas

TL;DR

This work introduces PriMORL, a model-based RL algorithm with formal differential privacy guarantees that enables the training of private RL agents on offline continuous control tasks with deep function approximations, whereas current methods are limited to simpler tabular and linear Markov Decision Processes (MDPs).

Abstract

We address private deep offline reinforcement learning (RL), where the goal is to train a policy on standard control tasks that is differentially private (DP) with respect to individual trajectories in the dataset. To achieve this, we introduce PriMORL, a model-based RL algorithm with formal differential privacy guarantees. PriMORL first learns an ensemble of trajectory-level DP models of the environment from offline data. It then optimizes a policy on the penalized private model, without any further interaction with the system or access to the dataset. In addition to offering strong theoretical foundations, we demonstrate empirically that PriMORL enables the training of private RL agents on offline continuous control tasks with deep function approximations, whereas current methods are limited to simpler tabular and linear Markov Decision Processes (MDPs). We furthermore outline the trade-offs involved in achieving privacy in this setting.

Differentially Private Deep Model-Based Reinforcement Learning

TL;DR

This work introduces PriMORL, a model-based RL algorithm with formal differential privacy guarantees that enables the training of private RL agents on offline continuous control tasks with deep function approximations, whereas current methods are limited to simpler tabular and linear Markov Decision Processes (MDPs).

Abstract

We address private deep offline reinforcement learning (RL), where the goal is to train a policy on standard control tasks that is differentially private (DP) with respect to individual trajectories in the dataset. To achieve this, we introduce PriMORL, a model-based RL algorithm with formal differential privacy guarantees. PriMORL first learns an ensemble of trajectory-level DP models of the environment from offline data. It then optimizes a policy on the penalized private model, without any further interaction with the system or access to the dataset. In addition to offering strong theoretical foundations, we demonstrate empirically that PriMORL enables the training of private RL agents on offline continuous control tasks with deep function approximations, whereas current methods are limited to simpler tabular and linear Markov Decision Processes (MDPs). We furthermore outline the trade-offs involved in achieving privacy in this setting.
Paper Structure (38 sections, 8 theorems, 21 equations, 7 figures, 6 tables, 4 algorithms)

This paper contains 38 sections, 8 theorems, 21 equations, 7 figures, 6 tables, 4 algorithms.

Key Result

Theorem 4.2

$(\epsilon, \delta)$-TDP guarantees for dynamics model. Given $\delta \in (0,1)$, noise multiplier $z$, sampling ratio $q$ and number of training iterations $T$, let $\epsilon := \epsilon^{\text{MA}}\left(z, q, T, \delta\right)$ be the privacy budget computed by the moments accounting method from Ab

Figures (7)

  • Figure 1: PriMORL with its two main components: 1) private model training and 2) MBPO.
  • Figure 2: Learning curves on Pendulum (left), Balance (middle) and Swingup (right).
  • Figure 3: Comparison of policy performance with $u_\text{MA}$ and $u_\text{MPD}$ for a fixed model. We measure the average performance of the policy over the last 10 epochs of training. Average and confidence intervals are computed over 5 random seeds.
  • Figure 4: Policy performance on Pendulum as a function of the privacy budget $\epsilon$. We measure the average performance of the policy over the last 5 epochs of training. Average and confidence intervals are computed over 5 random seeds.
  • Figure 5: Learning curves for the SAC policy on HalfCheetah (right). Policy performance (episodic return) is evaluated in the true MDP at the end of each training epoch, over 10 evaluation episodes with different random seeds.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Definition 3.1
  • Definition 4.1
  • Theorem 4.2
  • Proposition 4.2
  • Proposition 4.2
  • Theorem 4.3
  • Theorem A.1
  • proof
  • Theorem A.1
  • proof
  • ...and 3 more