Learning to Allocate Resources with Censored Feedback

Giovanni Montanari; Côme Fiegel; Corentin Pla; Aadirupa Saha; Vianney Perchet

Learning to Allocate Resources with Censored Feedback

Giovanni Montanari, Côme Fiegel, Corentin Pla, Aadirupa Saha, Vianney Perchet

TL;DR

The paper addresses online resource allocation across $K$ arms under censored feedback, where a reward requires both arm activation with probability $p_i$ and budget surpassing a latent threshold $X_{t,i} \sim G(\lambda_i)$. It introduces RA-UCB, an optimistic, batched-estimation algorithm that decouples reward collection from parameter estimation, achieving $\tilde{O}(\sqrt{T})$ regret (and poly-log improvements under stronger assumptions) in the known-budget setting, and proves a fundamental $\Omega(T^{1/3})$ lower bound. To handle unknown per-round budgets, the paper extends to MG-UCB, which uses within-round switching and a water-filling procedure, preserving the same regret guarantees. The approach is validated on real-world datasets (EdNet and Criteo-derived benchmarks), demonstrating practical effectiveness for online advertising and adaptive education scenarios. Overall, the work advances theoretical understanding of censored-threshold bandits and delivers practical, near-optimal algorithms for online resource allocation with incomplete feedback.

Abstract

We study the online resource allocation problem in which at each round, a budget $B$ must be allocated across $K$ arms under censored feedback. An arm yields a reward if and only if two conditions are satisfied: (i) the arm is activated according to an arm-specific Bernoulli random variable with unknown parameter, and (ii) the allocated budget exceeds a random threshold drawn from a parametric distribution with unknown parameter. Over $T$ rounds, the learner must jointly estimate the unknown parameters and allocate the budget so as to maximize cumulative reward facing the exploration--exploitation trade-off. We prove an information-theoretic regret lower bound $Ω(T^{1/3})$, demonstrating the intrinsic difficulty of the problem. We then propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds. When the budget $B$ is known at the beginning of each round, RA-UCB achieves a regret of order $\widetilde{\mathcal{O}}(\sqrt{T})$, and even $\mathcal{O}(\mathrm{poly}\text{-}\log T)$ under stronger assumptions. As for unknown, round dependent budget, we introduce MG-UCB, which allows within-round switching and infinitesimal allocations, and matches the regret guarantees of RA-UCB. We then validate our theoretical results through experiments on real-world datasets.

Learning to Allocate Resources with Censored Feedback

TL;DR

The paper addresses online resource allocation across

arms under censored feedback, where a reward requires both arm activation with probability

and budget surpassing a latent threshold

. It introduces RA-UCB, an optimistic, batched-estimation algorithm that decouples reward collection from parameter estimation, achieving

regret (and poly-log improvements under stronger assumptions) in the known-budget setting, and proves a fundamental

lower bound. To handle unknown per-round budgets, the paper extends to MG-UCB, which uses within-round switching and a water-filling procedure, preserving the same regret guarantees. The approach is validated on real-world datasets (EdNet and Criteo-derived benchmarks), demonstrating practical effectiveness for online advertising and adaptive education scenarios. Overall, the work advances theoretical understanding of censored-threshold bandits and delivers practical, near-optimal algorithms for online resource allocation with incomplete feedback.

Abstract

We study the online resource allocation problem in which at each round, a budget

must be allocated across

arms under censored feedback. An arm yields a reward if and only if two conditions are satisfied: (i) the arm is activated according to an arm-specific Bernoulli random variable with unknown parameter, and (ii) the allocated budget exceeds a random threshold drawn from a parametric distribution with unknown parameter. Over

rounds, the learner must jointly estimate the unknown parameters and allocate the budget so as to maximize cumulative reward facing the exploration--exploitation trade-off. We prove an information-theoretic regret lower bound

, demonstrating the intrinsic difficulty of the problem. We then propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds. When the budget

is known at the beginning of each round, RA-UCB achieves a regret of order

, and even

under stronger assumptions. As for unknown, round dependent budget, we introduce MG-UCB, which allows within-round switching and infinitesimal allocations, and matches the regret guarantees of RA-UCB. We then validate our theoretical results through experiments on real-world datasets.

Paper Structure (49 sections, 18 theorems, 296 equations, 7 figures, 2 tables, 4 algorithms)

This paper contains 49 sections, 18 theorems, 296 equations, 7 figures, 2 tables, 4 algorithms.

Introduction
Related Works
Contributions
The Known Budget Model
Regret Lower Bound
RA-UCB Algorithm
Estimation of $\lambda_i$
Estimation of $p_i$
Algorithm
Regret Upper Bounds
The Unknown Budget Model
Experiments
Conclusion and Future Directions
Experiments
Simulations and comparison with other algorithms
...and 34 more sections

Key Result

Theorem 3.1

Assume that $K\leq T$. Then, no algorithm guarantees for any parameter an expected regret of

Figures (7)

Figure 1: EdNet-KT3 quiz benchmark. Confidence intervals are computed over $5$ independent runs using $5$ batches of $1{,}000$ different users. $B=700s, K=20, T=1000$
Figure 2: Log-scale comparison of empirical and theoretical regret bounds for RA-UCB.
Figure 3: Comparison between RA-UCB (blue) and an Explore-Then-Commit baseline (RA-ETC) for $T=10{,}000$.
Figure 4: Comparison between RA-UCB (blue) and a naive baseline without confidence bounds (NO UCB).
Figure 5: Response-time distribution analysis for a representative EdNet-KT3 question (q4135). Empirical PDF (left), CDF (middle), and Q–Q plot (right) with fitted truncated Weibull model. Response times are fully observed, enabling direct validation of the Weibull threshold assumption.
...and 2 more figures

Theorems & Definitions (31)

Theorem 3.1
Lemma 4.2
Lemma 4.3
Lemma 4.6
Remark 4.7
Theorem 4.8
Theorem 4.11
Lemma 5.2
Remark 5.3: Implementability
Lemma B.1
...and 21 more

Learning to Allocate Resources with Censored Feedback

TL;DR

Abstract

Learning to Allocate Resources with Censored Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (31)