Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles
Xiangyu Chang, Xi Chen, Yining Wang, Zhiyi Zeng
TL;DR
This work studies continuum-armed bandit optimization with a biased pairwise comparison oracle, aiming to maximize an unknown smooth function $f$ over a horizon $T$ when direct evaluation of $f$ is unavailable. The authors introduce discretization and local polynomial regression to translate the problem into a linear-bandit paradigm, then deploy a novel tournament successive elimination framework with batched LinUCB updates to identify near-optimal regions, achieving regret optimal up to poly-log factors for Hölder-smooth objectives. They also address strongly concave objectives via proximal gradient methods with inexact gradient estimates, yielding near-minimax rates. The framework is applied to operations-management problems such as joint pricing/inventory replenishment and network revenue management, delivering improved regret bounds and practical performance in censored-demand settings and nonparametric-demand learning. Numerical experiments on synthetic smooth/concave objectives and inventory-censoring scenarios corroborate the theoretical guarantees and demonstrate improved efficiency over prior approaches.
Abstract
This paper studies a bandit optimization problem where the goal is to maximize a function $f(x)$ over $T$ periods for some unknown strongly concave function $f$. We consider a new pairwise comparison oracle, where the decision-maker chooses a pair of actions $(x, x')$ for a consecutive number of periods and then obtains an estimate of $f(x)-f(x')$. We show that such a pairwise comparison oracle finds important applications to joint pricing and inventory replenishment problems and network revenue management. The challenge in this bandit optimization is twofold. First, the decision-maker not only needs to determine a pair of actions $(x, x')$ but also a stopping time $n$ (i.e., the number of queries based on $(x, x')$). Second, motivated by our inventory application, the estimate of the difference $f(x)-f(x')$ is biased, which is different from existing oracles in stochastic optimization literature. To address these challenges, we first introduce a discretization technique and local polynomial approximation to relate this problem to linear bandits. Then we developed a tournament successive elimination technique to localize the discretized cell and run an interactive batched version of LinUCB algorithm on cells. We establish regret bounds that are optimal up to poly-logarithmic factors. Furthermore, we apply our proposed algorithm and analytical framework to the two operations management problems and obtain results that improve state-of-the-art results in the existing literature.
