Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

Xiangyu Chang; Xi Chen; Yining Wang; Zhiyi Zeng

Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

Xiangyu Chang, Xi Chen, Yining Wang, Zhiyi Zeng

TL;DR

This work studies continuum-armed bandit optimization with a biased pairwise comparison oracle, aiming to maximize an unknown smooth function $f$ over a horizon $T$ when direct evaluation of $f$ is unavailable. The authors introduce discretization and local polynomial regression to translate the problem into a linear-bandit paradigm, then deploy a novel tournament successive elimination framework with batched LinUCB updates to identify near-optimal regions, achieving regret optimal up to poly-log factors for Hölder-smooth objectives. They also address strongly concave objectives via proximal gradient methods with inexact gradient estimates, yielding near-minimax rates. The framework is applied to operations-management problems such as joint pricing/inventory replenishment and network revenue management, delivering improved regret bounds and practical performance in censored-demand settings and nonparametric-demand learning. Numerical experiments on synthetic smooth/concave objectives and inventory-censoring scenarios corroborate the theoretical guarantees and demonstrate improved efficiency over prior approaches.

Abstract

This paper studies a bandit optimization problem where the goal is to maximize a function $f(x)$ over $T$ periods for some unknown strongly concave function $f$. We consider a new pairwise comparison oracle, where the decision-maker chooses a pair of actions $(x, x')$ for a consecutive number of periods and then obtains an estimate of $f(x)-f(x')$. We show that such a pairwise comparison oracle finds important applications to joint pricing and inventory replenishment problems and network revenue management. The challenge in this bandit optimization is twofold. First, the decision-maker not only needs to determine a pair of actions $(x, x')$ but also a stopping time $n$ (i.e., the number of queries based on $(x, x')$). Second, motivated by our inventory application, the estimate of the difference $f(x)-f(x')$ is biased, which is different from existing oracles in stochastic optimization literature. To address these challenges, we first introduce a discretization technique and local polynomial approximation to relate this problem to linear bandits. Then we developed a tournament successive elimination technique to localize the discretized cell and run an interactive batched version of LinUCB algorithm on cells. We establish regret bounds that are optimal up to poly-logarithmic factors. Furthermore, we apply our proposed algorithm and analytical framework to the two operations management problems and obtain results that improve state-of-the-art results in the existing literature.

Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

TL;DR

This work studies continuum-armed bandit optimization with a biased pairwise comparison oracle, aiming to maximize an unknown smooth function

over a horizon

when direct evaluation of

is unavailable. The authors introduce discretization and local polynomial regression to translate the problem into a linear-bandit paradigm, then deploy a novel tournament successive elimination framework with batched LinUCB updates to identify near-optimal regions, achieving regret optimal up to poly-log factors for Hölder-smooth objectives. They also address strongly concave objectives via proximal gradient methods with inexact gradient estimates, yielding near-minimax rates. The framework is applied to operations-management problems such as joint pricing/inventory replenishment and network revenue management, delivering improved regret bounds and practical performance in censored-demand settings and nonparametric-demand learning. Numerical experiments on synthetic smooth/concave objectives and inventory-censoring scenarios corroborate the theoretical guarantees and demonstrate improved efficiency over prior approaches.

Abstract

This paper studies a bandit optimization problem where the goal is to maximize a function

over

periods for some unknown strongly concave function

. We consider a new pairwise comparison oracle, where the decision-maker chooses a pair of actions

for a consecutive number of periods and then obtains an estimate of

. We show that such a pairwise comparison oracle finds important applications to joint pricing and inventory replenishment problems and network revenue management. The challenge in this bandit optimization is twofold. First, the decision-maker not only needs to determine a pair of actions

but also a stopping time

(i.e., the number of queries based on

). Second, motivated by our inventory application, the estimate of the difference

is biased, which is different from existing oracles in stochastic optimization literature. To address these challenges, we first introduce a discretization technique and local polynomial approximation to relate this problem to linear bandits. Then we developed a tournament successive elimination technique to localize the discretized cell and run an interactive batched version of LinUCB algorithm on cells. We establish regret bounds that are optimal up to poly-logarithmic factors. Furthermore, we apply our proposed algorithm and analytical framework to the two operations management problems and obtain results that improve state-of-the-art results in the existing literature.

Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

TL;DR

Abstract

Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (26)