Table of Contents
Fetching ...

Preserving Expert-Level Privacy in Offline Reinforcement Learning

Navodita Sharma, Vishnu Vinod, Abhradeep Thakurta, Alekh Agarwal, Borja Balle, Christoph Dann, Aravindan Raghuveer

TL;DR

The paper addresses protecting expert-level privacy in offline reinforcement learning, where data from multiple privacy-sensitive experts are used to learn shared policies. It proposes a two-stage consensus-based approach: first, privately identify stable trajectory prefixes via a modified Sparse Vector Technique to form $D^{\\Pi}_{stable}$, which can train without added noise; second, apply DP-SGD on the remaining data to ensure expert-level privacy. The approach yields formal $(2\\varepsilon', \\\delta_1/(2T))$-DP guarantees for the data-release step and overall $(\\varepsilon, \\\delta)$ guarantees when combined with DP-SGD, while achieving superior empirical performance over a naive DP-SGD baseline on continuous-state RL benchmarks. This work enables practical privacy-preserving offline RL with gradient-based function approximators and broad applicability to real-world privacy-sensitive datasets.

Abstract

The offline reinforcement learning (RL) problem aims to learn an optimal policy from historical data collected by one or more behavioural policies (experts) by interacting with an environment. However, the individual experts may be privacy-sensitive in that the learnt policy may retain information about their precise choices. In some domains like personalized retrieval, advertising and healthcare, the expert choices are considered sensitive data. To provably protect the privacy of such experts, we propose a novel consensus-based expert-level differentially private offline RL training approach compatible with any existing offline RL algorithm. We prove rigorous differential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in differentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks.

Preserving Expert-Level Privacy in Offline Reinforcement Learning

TL;DR

The paper addresses protecting expert-level privacy in offline reinforcement learning, where data from multiple privacy-sensitive experts are used to learn shared policies. It proposes a two-stage consensus-based approach: first, privately identify stable trajectory prefixes via a modified Sparse Vector Technique to form , which can train without added noise; second, apply DP-SGD on the remaining data to ensure expert-level privacy. The approach yields formal -DP guarantees for the data-release step and overall guarantees when combined with DP-SGD, while achieving superior empirical performance over a naive DP-SGD baseline on continuous-state RL benchmarks. This work enables practical privacy-preserving offline RL with gradient-based function approximators and broad applicability to real-world privacy-sensitive datasets.

Abstract

The offline reinforcement learning (RL) problem aims to learn an optimal policy from historical data collected by one or more behavioural policies (experts) by interacting with an environment. However, the individual experts may be privacy-sensitive in that the learnt policy may retain information about their precise choices. In some domains like personalized retrieval, advertising and healthcare, the expert choices are considered sensitive data. To provably protect the privacy of such experts, we propose a novel consensus-based expert-level differentially private offline RL training approach compatible with any existing offline RL algorithm. We prove rigorous differential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in differentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks.

Paper Structure

This paper contains 23 sections, 13 theorems, 47 equations, 5 figures, 3 tables, 4 algorithms.

Key Result

Lemma 5.0

For a trajectory prefix $\tau_k$, and neighbouring expert sets $\Pi, \Pi'$, if $count_{\tau_k}(\Pi) \geq c_{min}$ then, where $c_{min}$ and $\varepsilon'$ are as defined in Algorithm alg:data_release.

Figures (5)

  • Figure 1: Training pipeline for Expert-level Differentially Private Offline RL given an expert set $\Pi = \{\pi_1, \pi_2, \dots, \pi_m\}$ and input trajectories logged into an offline dataset $D_\Pi$. $\mathcal{A}_{DR}$ (Algorithm \ref{['alg:data_release']}) splits input trajectories and adds stable prefixes to $D^\Pi_{stable}$ discarding the rest to $D^\Pi_{unst}$. These are used to train any off-the-shelf offline RL algorithm (see Algorithm \ref{['alg:selective_dpsgd']}) and learn an expert-level differentially policy $\pi_{private}$.
  • Figure 2: Expert return histograms, with Kernel Density Estimation, for experts trained on heterogeneous (left to right) LunarLander, Acrobot, CartPole and HIV Treatment environments. We use experts with wide ranges of performance on the test environment. Return values for HIV treatment normalised by $10^6$.
  • Figure 3: Performance of our method and DP-SGD for different values of $\varepsilon$ with $\delta = 1/m$, where $m$ is the number of experts. We report the episodic return, normalized between $0$ (random policy) and $1$ (optimal policy) averaged over $10$ evaluation runs (with $95\%$ confidence intervals) at the end of training, as a fraction of the non-private baseline. Our method consistently outperforms DP-SGD, especially in the high $\varepsilon$ regions.
  • Figure 4: Performance of our method and DP-SGD for $m=1000, 2000$ and $3000$ experts, on the Acrobot environment with $\varepsilon = 10.0$ and CQL as the underlying offline RL algorithm. We observe increasing improvements of our method over DP-SGD as $m$ increases.
  • Figure 5: Outline of Data Generation scheme; We train experts on multiple environments as described below.

Theorems & Definitions (25)

  • Definition 3.1: Offline RL
  • Definition 3.2: $(\varepsilon,\delta)$-DP
  • Definition 3.3: Expert- and Trajectory-level privacy
  • Lemma 5.0
  • Lemma 5.0
  • Theorem 5.1
  • Lemma 5.1
  • Theorem 5.2
  • proof
  • Theorem 5.3
  • ...and 15 more