Online Budget Allocation with Censored Semi-Bandit Feedback
François Bachoc, Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni
TL;DR
This work studies online budget allocation across K tasks with stochastic rewards and censored semi-bandit feedback, where only completed-task rewards are observed. It introduces an optimistic, UCB-style algorithm that operates on the continuous simplex and leverages the known budget-to-success curves to achieve strong sample efficiency. In the diminishing-returns regime, the algorithm attains polylogarithmic regret without manual tuning, while for general nondecreasing curves it achieves a minimax-optimal rate of {tilde}O(K√T}, and a matching lower bound {Omega}(K√T) holds even with full feedback, showing intrinsic hardness outside diminishing returns. The paper further demonstrates that standard bandit techniques cannot reach these rates, and it discusses extensions to unknown curves, delays, and contextual variants, highlighting practical relevance to crowdsourcing, autobidding, and compute-resource allocation for large-scale models.
Abstract
We study a stochastic budget-allocation problem over $K$ tasks. At each round $t$, the learner chooses an allocation $X_t \in Δ_K$. Task $k$ succeeds with probability $F_k(X_{t,k})$, where $F_1,\dots,F_K$ are nondecreasing budget-to-success curves, and upon success yields a random reward with unknown mean $μ_k$. The learner observes which tasks succeed, and observes a task's reward only upon success (censored semi-bandit feedback). This model captures, for instance, splitting payments across crowdsourcing workers or distributing bids across simultaneous auctions, and subsumes stochastic multi-armed bandits and semi-bandits. We design an optimism-based algorithm that operates under censored semi-bandit feedback. Our main result shows that in diminishing-returns regimes, the regret of this algorithm scales polylogarithmically with the horizon $T$ without any ad hoc tuning. For general nondecreasing curves, we prove that the same algorithm (with the same tuning) achieves a worst-case regret upper bound of $\tilde O(K\sqrt{T})$. Finally, we establish a matching worst-case regret lower bound of $Ω(K\sqrt{T})$ that holds even for full-feedback algorithms, highlighting the intrinsic hardness of our problem outside diminishing returns.
