Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations
Apostolos N. Burnetas, Odysseas Kanavetas, Michael N. Katehakis
TL;DR
This work tackles constrained multi-armed bandits where activations consume multiple resources that replenish at a constant rate, enforcing feasibility at every period. It derives an asymptotic lower bound on regret for uniformly fast policies and constructs block-based UCB policies that achieve this bound, including explicit forms for Normal (unknown means and possibly unknown variances) and finite-support discrete rewards. The proposed Z-UCB framework interleaves initial sampling with block LP-based exploration-exploitation, ensuring feasibility while asymptotically matching the theoretical lower limit. The results advance constrained sequential decision-making with practical implications for online revenue management and targeted advertising, by delivering provably optimal strategies under different distributional assumptions.
Abstract
This paper introduces the first asymptotically optimal strategy for a multi armed bandit (MAB) model under side constraints. The side constraints model situations in which bandit activations are limited by the availability of certain resources that are replenished at a constant rate. The main result involves the derivation of an asymptotic lower bound for the regret of feasible uniformly fast policies and the construction of policies that achieve this lower bound, under pertinent conditions. Further, we provide the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances, for the case of Normal distributions with unknown means and unknown variances and for the case of arbitrary discrete distributions with finite support.
