Table of Contents
Fetching ...

Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment

Bingshan Hu, Zhiming Huang, Nishant A. Mehta, Nidhi Hegde

TL;DR

This work investigates differential privacy in stochastic online learning under bandit and full-information feedback, establishing near-optimal regret guarantees. It introduces Anytime-Lazy-UCB and Lazy-DP-TS for private bandits, achieving the instance-dependent rate $O\left(\sum_{j:Δ_j>0}\frac{\ln T}{\min{\{Δ_j,ε\}}}\right)$, and RNM-FTNL for private full information with instance-dependent and minimax bounds, up to a log factor. The paper also proves lower bounds $Ω\left(\frac{\log K}{\min{\{Δ_{\min},ε\}}}\right)$ and $Ω\left(\sqrt{T \log K} + \frac{\log K}{ε}\right)$, clarifying the privacy cost and its interaction with problem structure. Experimental results validate the practical performance of the proposed methods and highlight the dominance of the private TS and RNM-based approaches in various regimes. Overall, the work advances private online learning by delivering anytime, near-optimal algorithms for both bandit and full-information settings and identifying key gaps for future research.

Abstract

In this paper, we study differentially private online learning problems in a stochastic environment under both bandit and full information feedback. For differentially private stochastic bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O \left(\sum_{j: Δ_j>0} \frac{\ln(T)}{\min \left\{Δ_j, ε\right\}} \right)$ instance-dependent regret bound, where $T$ is the finite learning horizon, $Δ_j$ denotes the suboptimality gap between the optimal arm and a suboptimal arm $j$, and $ε$ is the required privacy parameter. For the differentially private full information setting with stochastic rewards, we show an $Ω\left(\frac{\ln(K)}{\min \left\{Δ_{\min}, ε\right\}} \right)$ instance-dependent regret lower bound and an $Ω\left(\sqrt{T\ln(K)} + \frac{\ln(K)}ε\right)$ minimax lower bound, where $K$ is the total number of actions and $Δ_{\min}$ denotes the minimum suboptimality gap among all the suboptimal actions. For the same differentially private full information setting, we also present an $ε$-differentially private algorithm whose instance-dependent regret and worst-case regret match our respective lower bounds up to an extra $\log(T)$ factor.

Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment

TL;DR

This work investigates differential privacy in stochastic online learning under bandit and full-information feedback, establishing near-optimal regret guarantees. It introduces Anytime-Lazy-UCB and Lazy-DP-TS for private bandits, achieving the instance-dependent rate , and RNM-FTNL for private full information with instance-dependent and minimax bounds, up to a log factor. The paper also proves lower bounds and , clarifying the privacy cost and its interaction with problem structure. Experimental results validate the practical performance of the proposed methods and highlight the dominance of the private TS and RNM-based approaches in various regimes. Overall, the work advances private online learning by delivering anytime, near-optimal algorithms for both bandit and full-information settings and identifying key gaps for future research.

Abstract

In this paper, we study differentially private online learning problems in a stochastic environment under both bandit and full information feedback. For differentially private stochastic bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal instance-dependent regret bound, where is the finite learning horizon, denotes the suboptimality gap between the optimal arm and a suboptimal arm , and is the required privacy parameter. For the differentially private full information setting with stochastic rewards, we show an instance-dependent regret lower bound and an minimax lower bound, where is the total number of actions and denotes the minimum suboptimality gap among all the suboptimal actions. For the same differentially private full information setting, we also present an -differentially private algorithm whose instance-dependent regret and worst-case regret match our respective lower bounds up to an extra factor.

Paper Structure

This paper contains 41 sections, 17 theorems, 73 equations, 2 figures, 3 algorithms.

Key Result

Theorem 2

Algorithm Optimal DP-UCB is $\epsilon$-differentially private.

Figures (2)

  • Figure 1: The cumulative regret for the first setting.
  • Figure 2: The cumulative regret for the second setting.

Theorems & Definitions (18)

  • Definition 1: Differential privacy in online learning
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Lemma 6
  • Lemma 7: Restatement of Lemma 2.9 agrawal2017near
  • Theorem 8
  • Theorem 9
  • Corollary 10
  • ...and 8 more