Table of Contents
Fetching ...

Differentially Private Policy Gradient

Alexandre Rio, Merwan Barlier, Igor Colin

TL;DR

This work tackles the privacy risks of reinforcement learning by introducing a practical, scalable framework for differentially private policy gradient with on-policy updates. By reframing DP noise as a mechanism to enforce trust-region constraints, it preserves key properties of non-private policy gradient and TRPO/PPO-like methods, while providing trajectory- and joint-DP guarantees. The authors derive theoretical bounds for update-trust regions under DP noise (using distributions such as non-central $\,\chi^2$) and present a practical algorithm that clips per-user gradients, aggregates them, and injects Gaussian noise to obtain $(\epsilon,\delta)$-DP. Empirical results across tabular and continuous control tasks, personalized dosing, and RLHF demonstrate favorable privacy-utility trade-offs, enabling private RL deployment in real-world scenarios and offering a path toward private RL for LLM alignment.

Abstract

Motivated by the increasing deployment of reinforcement learning in the real world, involving a large consumption of personal data, we introduce a differentially private (DP) policy gradient algorithm. We show that, in this setting, the introduction of Differential Privacy can be reduced to the computation of appropriate trust regions, thus avoiding the sacrifice of theoretical properties of the DP-less methods. Therefore, we show that it is possible to find the right trade-off between privacy noise and trust-region size to obtain a performant differentially private policy gradient algorithm. We then outline its performance empirically on various benchmarks. Our results and the complexity of the tasks addressed represent a significant improvement over existing DP algorithms in online RL.

Differentially Private Policy Gradient

TL;DR

This work tackles the privacy risks of reinforcement learning by introducing a practical, scalable framework for differentially private policy gradient with on-policy updates. By reframing DP noise as a mechanism to enforce trust-region constraints, it preserves key properties of non-private policy gradient and TRPO/PPO-like methods, while providing trajectory- and joint-DP guarantees. The authors derive theoretical bounds for update-trust regions under DP noise (using distributions such as non-central ) and present a practical algorithm that clips per-user gradients, aggregates them, and injects Gaussian noise to obtain -DP. Empirical results across tabular and continuous control tasks, personalized dosing, and RLHF demonstrate favorable privacy-utility trade-offs, enabling private RL deployment in real-world scenarios and offering a path toward private RL for LLM alignment.

Abstract

Motivated by the increasing deployment of reinforcement learning in the real world, involving a large consumption of personal data, we introduce a differentially private (DP) policy gradient algorithm. We show that, in this setting, the introduction of Differential Privacy can be reduced to the computation of appropriate trust regions, thus avoiding the sacrifice of theoretical properties of the DP-less methods. Therefore, we show that it is possible to find the right trade-off between privacy noise and trust-region size to obtain a performant differentially private policy gradient algorithm. We then outline its performance empirically on various benchmarks. Our results and the complexity of the tasks addressed represent a significant improvement over existing DP algorithms in online RL.

Paper Structure

This paper contains 49 sections, 7 theorems, 56 equations, 8 figures, 4 tables, 3 algorithms.

Key Result

Theorem 4.3

($(\epsilon, \delta)$-DP Personalized RL). If the update mechanism $\mathcal{M}_\pi:D \longrightarrow\mathcal{M}(\pi, D^\pi) = \pi^\prime$ is $(\epsilon, \delta)$-TDP, then Algorithm alg:generic_protocol is both $(\epsilon, \delta)$-TDP and $(\epsilon, \delta)$-JDP.

Figures (8)

  • Figure 1: Cumulative regret on Riverswim for $\epsilon=1.0$ (dashed line) and $\epsilon=5.0$ (solid line).
  • Figure 2: Asymptotic performance vs. privacy budget $\epsilon$CartPole (left) and Acrobot (right) in log scale.
  • Figure 3: Relationship between the noise multiplier $z$ ($x$-axis) and the privacy budget $\epsilon$ ($y$-axis) of Algorithm \ref{['alg:dp_pg']}.
  • Figure 4: The Riverswim environment (taken from chowdhury_differentially_2021).
  • Figure 5: Cumulative regret on Riverswim for $\epsilon=1.0$ (dashed line) and $\epsilon=5.0$ (solid line).
  • ...and 3 more figures

Theorems & Definitions (14)

  • Definition 3.1
  • Definition 4.1
  • Definition 4.2
  • Theorem 4.3
  • Proposition 4.4
  • Proposition 4.5
  • Proposition 4.5
  • Proposition 4.6
  • Remark 5.1
  • Remark 5.2
  • ...and 4 more