Table of Contents
Fetching ...

Towards Optimal Differentially Private Regret Bounds in Linear MDPs

Sharan Sahu

TL;DR

This paper addresses privacy-preserving regret minimization in episodic inhomogeneous linear MDPs under joint differential privacy, where transitions and rewards are linear in a feature map $\boldsymbol{\phi}(s,a)$. It introduces DP-LSVI-UCB++ by privatizing LSVI-UCB++ with Gaussian noise and GOE perturbations, and proves a regret bound of $\tilde{O}( d\sqrt{H^3K} + H^{15/4} d^{7/6} K^{1/2} / \epsilon )$ alongside $(\epsilon, \delta')$-JDP guarantees. The analysis leverages Bernstein-type martingale concentration and variance-aware weighting, enabling tighter regret bounds than previous private approaches. The work advances the privacy-utility trade-off for linear MDPs and motivates extensions to low-rank MDPs and adaptive noise schemes.

Abstract

We study regret minimization under privacy constraints in episodic inhomogeneous linear Markov Decision Processes (MDPs), motivated by the growing use of reinforcement learning (RL) in personalized decision-making systems that rely on sensitive user data. In this setting, both transition probabilities and reward functions are assumed to be linear in a feature mapping $φ(s, a)$, and we aim to ensure privacy through joint differential privacy (JDP), a relaxation of differential privacy suited to online learning. Prior work has established suboptimal regret bounds by privatizing the LSVI-UCB algorithm, which achieves $\widetilde{O}(\sqrt{d^3 H^4 K})$ regret in the non-private setting. Building on recent advances that improve this to near minimax optimal regret $\widetilde{O}(d\sqrt{H^{3}K})$ via LSVI-UCB++ with Bernstein-style bonuses, we design a new differentially private algorithm by privatizing LSVI-UCB++ and adapting techniques for variance-aware analysis from offline RL. Our algorithm achieves a regret bound of $\widetilde{O}(d \sqrt{H^3 K} + H^{15/4} d^{7/6} K^{1/2} / ε)$, improving over previous private methods. Empirical results show that our algorithm retains near-optimal utility compared to non-private baselines, indicating that privacy can be achieved with minimal performance degradation in this setting.

Towards Optimal Differentially Private Regret Bounds in Linear MDPs

TL;DR

This paper addresses privacy-preserving regret minimization in episodic inhomogeneous linear MDPs under joint differential privacy, where transitions and rewards are linear in a feature map . It introduces DP-LSVI-UCB++ by privatizing LSVI-UCB++ with Gaussian noise and GOE perturbations, and proves a regret bound of alongside -JDP guarantees. The analysis leverages Bernstein-type martingale concentration and variance-aware weighting, enabling tighter regret bounds than previous private approaches. The work advances the privacy-utility trade-off for linear MDPs and motivates extensions to low-rank MDPs and adaptive noise schemes.

Abstract

We study regret minimization under privacy constraints in episodic inhomogeneous linear Markov Decision Processes (MDPs), motivated by the growing use of reinforcement learning (RL) in personalized decision-making systems that rely on sensitive user data. In this setting, both transition probabilities and reward functions are assumed to be linear in a feature mapping , and we aim to ensure privacy through joint differential privacy (JDP), a relaxation of differential privacy suited to online learning. Prior work has established suboptimal regret bounds by privatizing the LSVI-UCB algorithm, which achieves regret in the non-private setting. Building on recent advances that improve this to near minimax optimal regret via LSVI-UCB++ with Bernstein-style bonuses, we design a new differentially private algorithm by privatizing LSVI-UCB++ and adapting techniques for variance-aware analysis from offline RL. Our algorithm achieves a regret bound of , improving over previous private methods. Empirical results show that our algorithm retains near-optimal utility compared to non-private baselines, indicating that privacy can be achieved with minimal performance degradation in this setting.

Paper Structure

This paper contains 14 sections, 37 theorems, 159 equations, 1 figure.

Key Result

Lemma 2.1

If mechanism $A$ satisfies $\rho$-zCDP, then $A$ satisfies $\left( \rho + 2\sqrt{\rho \log \left( 1/\delta \right)}, \delta \right)$-DP.

Figures (1)

  • Figure :

Theorems & Definitions (62)

  • Definition 2.1: Differential Privacy [10.1007/11681878_14]
  • Definition 2.2: Joint Differential Privacy [kearns2015robust]
  • Definition 2.3: zCDP [DBLP:journals/corr/DworkR16, DBLP:journals/corr/BunS16]
  • Lemma 2.1: Converting zCDP to DP [DBLP:journals/corr/BunS16]
  • Lemma 2.2: Adaptive composition and Post processing of zCDP [DBLP:journals/corr/BunS16]
  • Definition 2.4: $l_{2}$-sensitivity
  • Lemma 2.3: Privacy guarantee of Gaussian mechanism [10.1561/0400000042, DBLP:journals/corr/BunS16]
  • Lemma 2.4: Billboard lemma [DBLP:journals/corr/HsuHRRW13]
  • Theorem 3.1: Privacy Guarantee
  • proof : Proof of Theorem \ref{['thm:privacy']}
  • ...and 52 more