Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment
Bingshan Hu, Zhiming Huang, Nishant A. Mehta, Nidhi Hegde
TL;DR
This work investigates differential privacy in stochastic online learning under bandit and full-information feedback, establishing near-optimal regret guarantees. It introduces Anytime-Lazy-UCB and Lazy-DP-TS for private bandits, achieving the instance-dependent rate $O\left(\sum_{j:Δ_j>0}\frac{\ln T}{\min{\{Δ_j,ε\}}}\right)$, and RNM-FTNL for private full information with instance-dependent and minimax bounds, up to a log factor. The paper also proves lower bounds $Ω\left(\frac{\log K}{\min{\{Δ_{\min},ε\}}}\right)$ and $Ω\left(\sqrt{T \log K} + \frac{\log K}{ε}\right)$, clarifying the privacy cost and its interaction with problem structure. Experimental results validate the practical performance of the proposed methods and highlight the dominance of the private TS and RNM-based approaches in various regimes. Overall, the work advances private online learning by delivering anytime, near-optimal algorithms for both bandit and full-information settings and identifying key gaps for future research.
Abstract
In this paper, we study differentially private online learning problems in a stochastic environment under both bandit and full information feedback. For differentially private stochastic bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O \left(\sum_{j: Δ_j>0} \frac{\ln(T)}{\min \left\{Δ_j, ε\right\}} \right)$ instance-dependent regret bound, where $T$ is the finite learning horizon, $Δ_j$ denotes the suboptimality gap between the optimal arm and a suboptimal arm $j$, and $ε$ is the required privacy parameter. For the differentially private full information setting with stochastic rewards, we show an $Ω\left(\frac{\ln(K)}{\min \left\{Δ_{\min}, ε\right\}} \right)$ instance-dependent regret lower bound and an $Ω\left(\sqrt{T\ln(K)} + \frac{\ln(K)}ε\right)$ minimax lower bound, where $K$ is the total number of actions and $Δ_{\min}$ denotes the minimum suboptimality gap among all the suboptimal actions. For the same differentially private full information setting, we also present an $ε$-differentially private algorithm whose instance-dependent regret and worst-case regret match our respective lower bounds up to an extra $\log(T)$ factor.
