Table of Contents
Fetching ...

Safe Linear Bandits over Unknown Polytopes

Aditya Gangrade, Tianrui Chen, Venkatesh Saligrama

TL;DR

This work introduces Safe Linear Bandits over Unknown Polytopes (SLB) where both the objective and the constraints are unknown and must be learned online from noisy feedback. It proves a fundamental hardness barrier showing that simultaneous polylogarithmic bounds on efficacy and safety cannot be achieved in general, then presents a doubly-optimistic algorithm (DOSS) that attains near-optimal trade-offs: $E_T=O( ext{polylog}(T))$ while $S_T= ilde{O}(rac{}{}{ ext{}}{T})$, with polylog dependence on unknown constraint counts and problem dimensions. Central to the analysis is a dual LP perspective: extreme points are recast as saturating $d$ constraints, and DOSS activates noisy versions of $d$ constraints to manage primal performance and constraint-satisfaction costs separately via new dual gaps. The paper also discusses finite-precision relaxations (tolerances on constraint levels and grid-precision for constraint parameters) that can significantly improve safety performance, and provides simulations demonstrating strong efficacy control and controlled safety violations. Overall, the results illuminate the fundamental limits of safe learning with unknown polytopes and offer a practical, theoretically-grounded method with strong performance guarantees and broad applicability to safety-constrained online decision-making.

Abstract

The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive doubly-optimistic play in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $Ω(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, DOSS, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, DOSS simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\tilde O(\sqrt{T})$ bounds on safety violations. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \algoname proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a `poor' set of constraints is activated, and rounds where `good' sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play.

Safe Linear Bandits over Unknown Polytopes

TL;DR

This work introduces Safe Linear Bandits over Unknown Polytopes (SLB) where both the objective and the constraints are unknown and must be learned online from noisy feedback. It proves a fundamental hardness barrier showing that simultaneous polylogarithmic bounds on efficacy and safety cannot be achieved in general, then presents a doubly-optimistic algorithm (DOSS) that attains near-optimal trade-offs: while , with polylog dependence on unknown constraint counts and problem dimensions. Central to the analysis is a dual LP perspective: extreme points are recast as saturating constraints, and DOSS activates noisy versions of constraints to manage primal performance and constraint-satisfaction costs separately via new dual gaps. The paper also discusses finite-precision relaxations (tolerances on constraint levels and grid-precision for constraint parameters) that can significantly improve safety performance, and provides simulations demonstrating strong efficacy control and controlled safety violations. Overall, the results illuminate the fundamental limits of safe learning with unknown polytopes and offer a practical, theoretically-grounded method with strong performance guarantees and broad applicability to safety-constrained online decision-making.

Abstract

The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive doubly-optimistic play in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, DOSS, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, DOSS simultaneously obtains tight instance-dependent bounds on efficacy regret, and bounds on safety violations. Further, when safety is demanded to a finite precision, violations improve to These results rely on a novel dual analysis of linear bandits: we argue that \algoname proceeds by activating noisy versions of at least constraints in each round, which allows us to separately analyse rounds where a `poor' set of constraints is activated, and rounds where `good' sets of constraints are activated. The costs in the former are controlled to by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to by explicitly analysing the solutions of optimistic play.
Paper Structure (69 sections, 22 theorems, 115 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 69 sections, 22 theorems, 115 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Lemma 3.2

The confidence sets are consistent, i.e., $\mathbb{P}\left( \forall t, \theta \in \mathcal{C}_t^\theta(\delta), A \in \boldsymbol{\mathcal{{C}}}_t(\delta) \right) \ge 1-\delta.$ Further, under consistency, the noise scale $\rho_t(x;\delta)$ at any $x \in \mathcal{X}$ satisfies $\forall x \in \math Finally, for any sequence $\{x_t\}$, $\sum_{s\le t} \rho_s(x_s)^2 = O(d^2 \log^2 t)$ and $\sum_{s \

Figures (7)

  • Figure 1: The challenge, and our approach.Left. The usual primal view of linear bandits over polytopes breaks down, since noisy estimates of the unknown $A$ induce a continuum of potential locations for extreme points (red blobs). Right Taking a dual linear programming view, we can identify extreme points as arising by saturating $d$ independent constraints. We generalise this view by showing that doss plays by saturating noisy versions of $d$ constraints. Poor play can arise from picking the wrong set of constraints (blue), or using a poor estimate for the right set of constraints (red).
  • Figure 2: An obstruction to logarithmic bounds in safe linear bandits.
  • Figure 3: Illustration of Ex. \ref{['example:triangle']}. The black lines represent the known constraints, the red line is the unknown constraint, and the blue line is the locus of optimality.
  • Figure 4: Illustration of gaps in Ex. \ref{['example:triangle']}. $x^{I_1}$ is the purple dot, and the activation polytope $\mathcal{T}(\rho_t;I_1)$ is shown in purple, along with the separation $\gamma(I_1)$. The spread $\mathfrak{s}(I_1)$ is the inner product of the direction in which $\mathcal{T}$ varies and $\theta$. For $I_4,$ the feasibility gap $\zeta_*(I)$ is illustrated geometrically in orange.
  • Figure 5: Efficacy Regret and Safety Violation of doss . We plot averages and one standard deviation confidence regions over 30 runs for $\mathscr{E}_T$ (left) and both $\mathscr{S}_t$ and $\mathscr{S}_t^{0.05}$ (right). We also plot the upper bounds we show in the latter to contextualise the observations. Observe that the efficacy regret is marginal: the mean is essentially $0,$ and the variance limited. Further, observe that the growth of the net safety violation $\mathscr{S}_t$ is well-controlled, and lies far below the bounds of §\ref{['section:regret']}. Further, the finite precision violations show a strong flattening, as is expected from Theorem \ref{['thm:main_regret_bound']}.
  • ...and 2 more figures

Theorems & Definitions (50)

  • Definition 3.1
  • Lemma 3.2
  • Theorem 4.1
  • Theorem 5.1
  • Definition 6.1
  • Definition 6.2
  • Example 6.3
  • Definition 6.4
  • Proposition 6.5
  • Lemma 6.6
  • ...and 40 more