Table of Contents
Fetching ...

Blocking Bandits

Soumya Basu, Rajat Sen, Sujay Sanghavi, Sanjay Shakkottai

TL;DR

Blocking Bandits extend stochastic multi-armed bandits by introducing per-arm blocking times, making future availability depend on past actions. The authors establish offline hardness by a reduction to PINWHEEL scheduling, and show an Oracle Greedy algorithm achieves a $(1-1/e-O(1/T))$-approximation when rewards are known. For unknown rewards, a UCB Greedy strategy yields logarithmic regret against the greedy baseline, aided by a novel free-exploration phenomenon induced by blocking. In the equal-delay case, the problem reduces to combinatorial semi-bandits, yielding matching lower bounds and validating the problem's intrinsic difficulty. Together, these results illuminate the fundamental trade-offs between scheduling feasibility, learning, and exploration in blocked-action bandit settings with practical implications for recommendations and resource scheduling.

Abstract

We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically $(1-1/e)$ optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have $c \log T + o(\log T)$ cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of $c' \log T+ ω(\log T)$.

Blocking Bandits

TL;DR

Blocking Bandits extend stochastic multi-armed bandits by introducing per-arm blocking times, making future availability depend on past actions. The authors establish offline hardness by a reduction to PINWHEEL scheduling, and show an Oracle Greedy algorithm achieves a -approximation when rewards are known. For unknown rewards, a UCB Greedy strategy yields logarithmic regret against the greedy baseline, aided by a novel free-exploration phenomenon induced by blocking. In the equal-delay case, the problem reduces to combinatorial semi-bandits, yielding matching lower bounds and validating the problem's intrinsic difficulty. Together, these results illuminate the fundamental trade-offs between scheduling feasibility, learning, and exploration in blocked-action bandit settings with practical implications for recommendations and resource scheduling.

Abstract

We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of .

Paper Structure

This paper contains 23 sections, 11 theorems, 27 equations, 2 figures, 1 algorithm.

Key Result

Theorem 3.1

MAXREWARD is at least as hard as PINWHEEL SCHEDULING on dense instances.

Figures (2)

  • Figure 1: Cumulative regrets scale as logarithmic, constant, and negative linear regret with randomly initialized delays, in Fig.\ref{['fig:positve']}, Fig.\ref{['fig:identical']}, and Fig.\ref{['fig:negative']}, respectively. Fig.\ref{['fig:scaling']}: Scaling of regret with identical delays $K^*$.
  • Figure 2: Scaling of regret with $K^*$ in jokes recommendation with blocking.

Theorems & Definitions (22)

  • Theorem 3.1
  • Corollary 3.2
  • proof
  • Theorem 3.3
  • Proposition 3.4
  • proof
  • Proposition 3.5
  • proof
  • Theorem 4.1
  • proof : Proof Sketch of Theorem \ref{['thm:regretUCBG']}
  • ...and 12 more