Lipschitz Bandits with Stochastic Delayed Feedback
Zhongxuan Liu, Yue Kang, Thomas C. M. Lee
TL;DR
This work advances continuum-armed (Lipschitz) bandits under stochastic delayed feedback by designing two complementary algorithms. For bounded delays, the Delayed Zooming algorithm preserves the delay-free regret rate up to an additive term scaling with the maximal delay, via a lazy-update mechanism that stabilizes confidence bounds. For unbounded delays, the DLPP method employs phased pruning and uniform sampling to accumulate reliable feedback and achieve near-optimal regret, with an additive term governed by delay quantiles, supported by a matching lower bound up to logarithmic factors. The results demonstrate sublinear regret across delay regimes and establish foundational theory for Lipschitz bandits with delays, complemented by empirical validation. These insights are significant for real-world continuum-armed decision problems where feedback is delayed or intermittently missing, such as hyperparameter tuning and dynamic pricing.
Abstract
The Lipschitz bandit problem extends stochastic bandits to a continuous action set defined over a metric space, where the expected reward function satisfies a Lipschitz condition. In this work, we introduce a new problem of Lipschitz bandit in the presence of stochastic delayed feedback, where the rewards are not observed immediately but after a random delay. We consider both bounded and unbounded stochastic delays, and design algorithms that attain sublinear regret guarantees in each setting. For bounded delays, we propose a delay-aware zooming algorithm that retains the optimal performance of the delay-free setting up to an additional term that scales with the maximal delay $τ_{\max}$. For unbounded delays, we propose a novel phased learning strategy that accumulates reliable feedback over carefully scheduled intervals, and establish a regret lower bound showing that our method is nearly optimal up to logarithmic factors. Finally, we present experimental results to demonstrate the efficiency of our algorithms under various delay scenarios.
