Table of Contents
Fetching ...

A New Benchmark for Online Learning with Budget-Balancing Constraints

Mark Braverman, Jingyi Liu, Jieming Mao, Jon Schneider, Eric Xue

TL;DR

The paper tackles online learning under a single-resource BwK setting in adversarial environments by introducing an Earth Mover's Distance (EMD) based benchmark. It proposes the LagrangianEMD algorithm, a primal-dual method that combines EXP4-IX and online gradient descent to achieve sublinear regret against the benchmark Opt_{D,F}, with regret bounds that scale as O( \sqrt{T |A| \, \log|F|} \cdot \log(1/\delta) + \sqrt{D}) under a bounded reward-to-cost ratio. A special windowed pacing benchmark yields near-optimal rates on the order of \tilde{O}(T/\sqrt{w} + \sqrt{wT}) and a matching lower bound, while the framework demonstrates the necessity of the EMD condition to obtain sublinear regret. The results provide a flexible, practically motivated benchmarking and algorithmic approach for time-varying budgeting scenarios, with concrete implications for autobidding and similar real-world settings.

Abstract

The adversarial Bandit with Knapsack problem is a multi-armed bandits problem with budget constraints and adversarial rewards and costs. In each round, a learner selects an action to take and observes the reward and cost of the selected action. The goal is to maximize the sum of rewards while satisfying the budget constraint. The classical benchmark to compare against is the best fixed distribution over actions that satisfies the budget constraint in expectation. Unlike its stochastic counterpart, where rewards and costs are drawn from some fixed distribution (Badanidiyuru et al., 2018), the adversarial BwK problem does not admit a no-regret algorithm for every problem instance due to the "spend-or-save" dilemma (Immorlica et al., 2022). A key problem left open by existing works is whether there exists a weaker but still meaningful benchmark to compare against such that no-regret learning is still possible. In this work, we present a new benchmark to compare against, motivated both by real-world applications such as autobidding and by its underlying mathematical structure. The benchmark is based on the Earth Mover's Distance (EMD), and we show that sublinear regret is attainable against any strategy whose spending pattern is within EMD $o(T^2)$ of any sub-pacing spending pattern. As a special case, we obtain results against the "pacing over windows" benchmark, where we partition time into disjoint windows of size $w$ and allow the benchmark strategies to choose a different distribution over actions for each window while satisfying a pacing budget constraint. Against this benchmark, our algorithm obtains a regret bound of $\tilde{O}(T/\sqrt{w}+\sqrt{wT})$. We also show a matching lower bound, proving the optimality of our algorithm in this important special case. In addition, we provide further evidence of the necessity of the EMD condition for obtaining a sublinear regret.

A New Benchmark for Online Learning with Budget-Balancing Constraints

TL;DR

The paper tackles online learning under a single-resource BwK setting in adversarial environments by introducing an Earth Mover's Distance (EMD) based benchmark. It proposes the LagrangianEMD algorithm, a primal-dual method that combines EXP4-IX and online gradient descent to achieve sublinear regret against the benchmark Opt_{D,F}, with regret bounds that scale as O( \sqrt{T |A| \, \log|F|} \cdot \log(1/\delta) + \sqrt{D}) under a bounded reward-to-cost ratio. A special windowed pacing benchmark yields near-optimal rates on the order of \tilde{O}(T/\sqrt{w} + \sqrt{wT}) and a matching lower bound, while the framework demonstrates the necessity of the EMD condition to obtain sublinear regret. The results provide a flexible, practically motivated benchmarking and algorithmic approach for time-varying budgeting scenarios, with concrete implications for autobidding and similar real-world settings.

Abstract

The adversarial Bandit with Knapsack problem is a multi-armed bandits problem with budget constraints and adversarial rewards and costs. In each round, a learner selects an action to take and observes the reward and cost of the selected action. The goal is to maximize the sum of rewards while satisfying the budget constraint. The classical benchmark to compare against is the best fixed distribution over actions that satisfies the budget constraint in expectation. Unlike its stochastic counterpart, where rewards and costs are drawn from some fixed distribution (Badanidiyuru et al., 2018), the adversarial BwK problem does not admit a no-regret algorithm for every problem instance due to the "spend-or-save" dilemma (Immorlica et al., 2022). A key problem left open by existing works is whether there exists a weaker but still meaningful benchmark to compare against such that no-regret learning is still possible. In this work, we present a new benchmark to compare against, motivated both by real-world applications such as autobidding and by its underlying mathematical structure. The benchmark is based on the Earth Mover's Distance (EMD), and we show that sublinear regret is attainable against any strategy whose spending pattern is within EMD of any sub-pacing spending pattern. As a special case, we obtain results against the "pacing over windows" benchmark, where we partition time into disjoint windows of size and allow the benchmark strategies to choose a different distribution over actions for each window while satisfying a pacing budget constraint. Against this benchmark, our algorithm obtains a regret bound of . We also show a matching lower bound, proving the optimality of our algorithm in this important special case. In addition, we provide further evidence of the necessity of the EMD condition for obtaining a sublinear regret.

Paper Structure

This paper contains 17 sections, 9 theorems, 54 equations, 2 algorithms.

Key Result

Theorem 4.1

Let $D \geq 0$ and $F \subseteq \Delta(A)^T$. If there exists a scalar $\alpha \geq 0$ such that $r_t(a) \leq \alpha \cdot c_t(a)$ for all time steps $t$ and actions $a$, then the regret against $\textsc{Opt}_{D,F}$ of choosing actions according to $\texttt{LagrangianEMD}(D, F)$ with $\overline{\lam

Theorems & Definitions (18)

  • Theorem 4.1
  • proof
  • Corollary 4.2
  • proof
  • Lemma 5.1
  • proof
  • Theorem 5.2
  • proof
  • Theorem 5.3
  • proof
  • ...and 8 more