Preserving Expert-Level Privacy in Offline Reinforcement Learning
Navodita Sharma, Vishnu Vinod, Abhradeep Thakurta, Alekh Agarwal, Borja Balle, Christoph Dann, Aravindan Raghuveer
TL;DR
The paper addresses protecting expert-level privacy in offline reinforcement learning, where data from multiple privacy-sensitive experts are used to learn shared policies. It proposes a two-stage consensus-based approach: first, privately identify stable trajectory prefixes via a modified Sparse Vector Technique to form $D^{\\Pi}_{stable}$, which can train without added noise; second, apply DP-SGD on the remaining data to ensure expert-level privacy. The approach yields formal $(2\\varepsilon', \\\delta_1/(2T))$-DP guarantees for the data-release step and overall $(\\varepsilon, \\\delta)$ guarantees when combined with DP-SGD, while achieving superior empirical performance over a naive DP-SGD baseline on continuous-state RL benchmarks. This work enables practical privacy-preserving offline RL with gradient-based function approximators and broad applicability to real-world privacy-sensitive datasets.
Abstract
The offline reinforcement learning (RL) problem aims to learn an optimal policy from historical data collected by one or more behavioural policies (experts) by interacting with an environment. However, the individual experts may be privacy-sensitive in that the learnt policy may retain information about their precise choices. In some domains like personalized retrieval, advertising and healthcare, the expert choices are considered sensitive data. To provably protect the privacy of such experts, we propose a novel consensus-based expert-level differentially private offline RL training approach compatible with any existing offline RL algorithm. We prove rigorous differential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in differentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks.
