Table of Contents
Fetching ...

Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

Xiangyu Chang, Xi Chen, Yining Wang, Zhiyi Zeng

TL;DR

This work studies continuum-armed bandit optimization with a biased pairwise comparison oracle, aiming to maximize an unknown smooth function $f$ over a horizon $T$ when direct evaluation of $f$ is unavailable. The authors introduce discretization and local polynomial regression to translate the problem into a linear-bandit paradigm, then deploy a novel tournament successive elimination framework with batched LinUCB updates to identify near-optimal regions, achieving regret optimal up to poly-log factors for Hölder-smooth objectives. They also address strongly concave objectives via proximal gradient methods with inexact gradient estimates, yielding near-minimax rates. The framework is applied to operations-management problems such as joint pricing/inventory replenishment and network revenue management, delivering improved regret bounds and practical performance in censored-demand settings and nonparametric-demand learning. Numerical experiments on synthetic smooth/concave objectives and inventory-censoring scenarios corroborate the theoretical guarantees and demonstrate improved efficiency over prior approaches.

Abstract

This paper studies a bandit optimization problem where the goal is to maximize a function $f(x)$ over $T$ periods for some unknown strongly concave function $f$. We consider a new pairwise comparison oracle, where the decision-maker chooses a pair of actions $(x, x')$ for a consecutive number of periods and then obtains an estimate of $f(x)-f(x')$. We show that such a pairwise comparison oracle finds important applications to joint pricing and inventory replenishment problems and network revenue management. The challenge in this bandit optimization is twofold. First, the decision-maker not only needs to determine a pair of actions $(x, x')$ but also a stopping time $n$ (i.e., the number of queries based on $(x, x')$). Second, motivated by our inventory application, the estimate of the difference $f(x)-f(x')$ is biased, which is different from existing oracles in stochastic optimization literature. To address these challenges, we first introduce a discretization technique and local polynomial approximation to relate this problem to linear bandits. Then we developed a tournament successive elimination technique to localize the discretized cell and run an interactive batched version of LinUCB algorithm on cells. We establish regret bounds that are optimal up to poly-logarithmic factors. Furthermore, we apply our proposed algorithm and analytical framework to the two operations management problems and obtain results that improve state-of-the-art results in the existing literature.

Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

TL;DR

This work studies continuum-armed bandit optimization with a biased pairwise comparison oracle, aiming to maximize an unknown smooth function over a horizon when direct evaluation of is unavailable. The authors introduce discretization and local polynomial regression to translate the problem into a linear-bandit paradigm, then deploy a novel tournament successive elimination framework with batched LinUCB updates to identify near-optimal regions, achieving regret optimal up to poly-log factors for Hölder-smooth objectives. They also address strongly concave objectives via proximal gradient methods with inexact gradient estimates, yielding near-minimax rates. The framework is applied to operations-management problems such as joint pricing/inventory replenishment and network revenue management, delivering improved regret bounds and practical performance in censored-demand settings and nonparametric-demand learning. Numerical experiments on synthetic smooth/concave objectives and inventory-censoring scenarios corroborate the theoretical guarantees and demonstrate improved efficiency over prior approaches.

Abstract

This paper studies a bandit optimization problem where the goal is to maximize a function over periods for some unknown strongly concave function . We consider a new pairwise comparison oracle, where the decision-maker chooses a pair of actions for a consecutive number of periods and then obtains an estimate of . We show that such a pairwise comparison oracle finds important applications to joint pricing and inventory replenishment problems and network revenue management. The challenge in this bandit optimization is twofold. First, the decision-maker not only needs to determine a pair of actions but also a stopping time (i.e., the number of queries based on ). Second, motivated by our inventory application, the estimate of the difference is biased, which is different from existing oracles in stochastic optimization literature. To address these challenges, we first introduce a discretization technique and local polynomial approximation to relate this problem to linear bandits. Then we developed a tournament successive elimination technique to localize the discretized cell and run an interactive batched version of LinUCB algorithm on cells. We establish regret bounds that are optimal up to poly-logarithmic factors. Furthermore, we apply our proposed algorithm and analytical framework to the two operations management problems and obtain results that improve state-of-the-art results in the existing literature.

Paper Structure

This paper contains 49 sections, 13 theorems, 68 equations, 7 figures, 4 tables, 4 algorithms.

Key Result

Lemma 1

Let $f\in\Sigma_d(k,M)$. Then for any $\boldsymbol j\in[J]^d$, there exists $\theta_{\boldsymbol j}\in\mathbb R^{\nu}$, $\|\theta_{\boldsymbol j}\|_2\leq M\sqrt{\nu}$, such that

Figures (7)

  • Figure 1: Average regret of $f_1$
  • Figure 2: Average regret of $f_2$
  • Figure 3: Average regrets of $f_3$ with $d=2$.
  • Figure 4: Average regrets of $f_4$ with $d=2$.
  • Figure 5: Overall average regret under concave $G(\cdot)$ (left two panels), and non-concave $G(\cdot)$ (the rightmost panel)
  • ...and 2 more figures

Theorems & Definitions (26)

  • Definition 1
  • Example 1: The standard bandit feedback
  • Example 2: The continuous dueling bandit
  • Definition 2: The Hölder class
  • Definition 3: Strongly concave functions
  • Example 3: Network revenue management
  • Lemma 1
  • Lemma 2
  • Remark 1
  • Remark 2
  • ...and 16 more