Table of Contents
Fetching ...

Online Budget Allocation with Censored Semi-Bandit Feedback

François Bachoc, Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni

TL;DR

This work studies online budget allocation across K tasks with stochastic rewards and censored semi-bandit feedback, where only completed-task rewards are observed. It introduces an optimistic, UCB-style algorithm that operates on the continuous simplex and leverages the known budget-to-success curves to achieve strong sample efficiency. In the diminishing-returns regime, the algorithm attains polylogarithmic regret without manual tuning, while for general nondecreasing curves it achieves a minimax-optimal rate of {tilde}O(K√T}, and a matching lower bound {Omega}(K√T) holds even with full feedback, showing intrinsic hardness outside diminishing returns. The paper further demonstrates that standard bandit techniques cannot reach these rates, and it discusses extensions to unknown curves, delays, and contextual variants, highlighting practical relevance to crowdsourcing, autobidding, and compute-resource allocation for large-scale models.

Abstract

We study a stochastic budget-allocation problem over $K$ tasks. At each round $t$, the learner chooses an allocation $X_t \in Δ_K$. Task $k$ succeeds with probability $F_k(X_{t,k})$, where $F_1,\dots,F_K$ are nondecreasing budget-to-success curves, and upon success yields a random reward with unknown mean $μ_k$. The learner observes which tasks succeed, and observes a task's reward only upon success (censored semi-bandit feedback). This model captures, for instance, splitting payments across crowdsourcing workers or distributing bids across simultaneous auctions, and subsumes stochastic multi-armed bandits and semi-bandits. We design an optimism-based algorithm that operates under censored semi-bandit feedback. Our main result shows that in diminishing-returns regimes, the regret of this algorithm scales polylogarithmically with the horizon $T$ without any ad hoc tuning. For general nondecreasing curves, we prove that the same algorithm (with the same tuning) achieves a worst-case regret upper bound of $\tilde O(K\sqrt{T})$. Finally, we establish a matching worst-case regret lower bound of $Ω(K\sqrt{T})$ that holds even for full-feedback algorithms, highlighting the intrinsic hardness of our problem outside diminishing returns.

Online Budget Allocation with Censored Semi-Bandit Feedback

TL;DR

This work studies online budget allocation across K tasks with stochastic rewards and censored semi-bandit feedback, where only completed-task rewards are observed. It introduces an optimistic, UCB-style algorithm that operates on the continuous simplex and leverages the known budget-to-success curves to achieve strong sample efficiency. In the diminishing-returns regime, the algorithm attains polylogarithmic regret without manual tuning, while for general nondecreasing curves it achieves a minimax-optimal rate of {tilde}O(K√T}, and a matching lower bound {Omega}(K√T) holds even with full feedback, showing intrinsic hardness outside diminishing returns. The paper further demonstrates that standard bandit techniques cannot reach these rates, and it discusses extensions to unknown curves, delays, and contextual variants, highlighting practical relevance to crowdsourcing, autobidding, and compute-resource allocation for large-scale models.

Abstract

We study a stochastic budget-allocation problem over tasks. At each round , the learner chooses an allocation . Task succeeds with probability , where are nondecreasing budget-to-success curves, and upon success yields a random reward with unknown mean . The learner observes which tasks succeed, and observes a task's reward only upon success (censored semi-bandit feedback). This model captures, for instance, splitting payments across crowdsourcing workers or distributing bids across simultaneous auctions, and subsumes stochastic multi-armed bandits and semi-bandits. We design an optimism-based algorithm that operates under censored semi-bandit feedback. Our main result shows that in diminishing-returns regimes, the regret of this algorithm scales polylogarithmically with the horizon without any ad hoc tuning. For general nondecreasing curves, we prove that the same algorithm (with the same tuning) achieves a worst-case regret upper bound of . Finally, we establish a matching worst-case regret lower bound of that holds even for full-feedback algorithms, highlighting the intrinsic hardness of our problem outside diminishing returns.

Paper Structure

This paper contains 28 sections, 18 theorems, 211 equations.

Key Result

Theorem 1

Under ass:speed-up, for any time horizon $T \ge 2$, if we run algo:ucbowski with parameters $T$ and $\delta \coloneqq \frac{1}{(KT)^2}$, its regret satisfies where $c_K$ is a polynomial function of $K$, which powers and coefficients depend on $a_1,\dots,a_K$ and $\mu_1,\dots,\mu_K$ (see prop:explicit:constant in Appendix s:appe-speed-ups for more details).

Theorems & Definitions (18)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • Theorem 4
  • Theorem 5: Unknown budget-to-success curves imply linear minimax regret
  • Lemma 6
  • Corollary 2
  • Corollary 3
  • Lemma 7
  • ...and 8 more