Constrained Best Arm Identification in Grouped Bandits
Sahil Dharod, Malyala Preethi Sravani, Sakshi Heda, Sharayu Moharir
TL;DR
The paper tackles identifying the best feasible arm in a grouped-bandit setting where each arm comprises $M$ independent attributes and feasibility requires all attributes to exceed a threshold $\mu_{\text{TH}}$. It introduces CSS-LUCB, a confidence-set LUCB-style algorithm that adaptively samples arm-attribute pairs using confidence intervals to efficiently prune infeasible arms and pinpoint the best feasible arm. The authors prove a fundamental lower bound on sample complexity and show that CSS-LUCB achieves a matching upper bound up to logarithmic factors, quantified by the problem hardness index $H_{\text{id}}$, and they validate the approach with simulations against modified action-elimination methods. The results demonstrate near-optimal sample efficiency and practical improvements for safety- and feasibility-constrained best-arm identification in structured, multi-attribute arms, with potential applications in multi-service evaluation and other grouped-bandit problems.
Abstract
We study a grouped bandit setting where each arm comprises multiple independent sub-arms referred to as attributes. Each attribute of each arm has an independent stochastic reward. We impose the constraint that for an arm to be deemed feasible, the mean reward of all its attributes should exceed a specified threshold. The goal is to find the arm with the highest mean reward averaged across attributes among the set of feasible arms in the fixed confidence setting. We first characterize a fundamental limit on the performance of any policy. Following this, we propose a near-optimal confidence interval-based policy to solve this problem and provide analytical guarantees for the policy. We compare the performance of the proposed policy with that of two suitably modified versions of action elimination via simulations.
