Table of Contents
Fetching ...

Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations

Apostolos N. Burnetas, Odysseas Kanavetas, Michael N. Katehakis

TL;DR

This work tackles constrained multi-armed bandits where activations consume multiple resources that replenish at a constant rate, enforcing feasibility at every period. It derives an asymptotic lower bound on regret for uniformly fast policies and constructs block-based UCB policies that achieve this bound, including explicit forms for Normal (unknown means and possibly unknown variances) and finite-support discrete rewards. The proposed Z-UCB framework interleaves initial sampling with block LP-based exploration-exploitation, ensuring feasibility while asymptotically matching the theoretical lower limit. The results advance constrained sequential decision-making with practical implications for online revenue management and targeted advertising, by delivering provably optimal strategies under different distributional assumptions.

Abstract

This paper introduces the first asymptotically optimal strategy for a multi armed bandit (MAB) model under side constraints. The side constraints model situations in which bandit activations are limited by the availability of certain resources that are replenished at a constant rate. The main result involves the derivation of an asymptotic lower bound for the regret of feasible uniformly fast policies and the construction of policies that achieve this lower bound, under pertinent conditions. Further, we provide the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances, for the case of Normal distributions with unknown means and unknown variances and for the case of arbitrary discrete distributions with finite support.

Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations

TL;DR

This work tackles constrained multi-armed bandits where activations consume multiple resources that replenish at a constant rate, enforcing feasibility at every period. It derives an asymptotic lower bound on regret for uniformly fast policies and constructs block-based UCB policies that achieve this bound, including explicit forms for Normal (unknown means and possibly unknown variances) and finite-support discrete rewards. The proposed Z-UCB framework interleaves initial sampling with block LP-based exploration-exploitation, ensuring feasibility while asymptotically matching the theoretical lower limit. The results advance constrained sequential decision-making with practical implications for online revenue management and targeted advertising, by delivering provably optimal strategies under different distributional assumptions.

Abstract

This paper introduces the first asymptotically optimal strategy for a multi armed bandit (MAB) model under side constraints. The side constraints model situations in which bandit activations are limited by the availability of certain resources that are replenished at a constant rate. The main result involves the derivation of an asymptotic lower bound for the regret of feasible uniformly fast policies and the construction of policies that achieve this lower bound, under pertinent conditions. Further, we provide the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances, for the case of Normal distributions with unknown means and unknown variances and for the case of arbitrary discrete distributions with finite support.

Paper Structure

This paper contains 16 sections, 7 theorems, 115 equations.

Key Result

Lemma 1

For any optimal matrix $B$ under $\underline{\underline{\theta}}$, such that for any $\underline{\theta}_{\alpha}^{'}\in\Delta\Theta_{\alpha}(\underline{\underline{\theta}})$ the following is true

Theorems & Definitions (7)

  • Lemma 1
  • Lemma 2
  • Proposition 3
  • Lemma 4
  • Theorem 5
  • Theorem 6
  • Lemma 7