Table of Contents
Fetching ...

Conservative Contextual Bandits: Beyond Linear Representations

Rohan Deb, Mohammad Ghavamzadeh, Arindam Banerjee

TL;DR

This work extends Conservative Contextual Bandits to general non-linear cost functions by introducing IGW-based exploration tied to an online regression oracle. It presents two algorithms, C-SquareCB (sublinear regret) and C-FastCB (first-order, data-dependent regret in $L^*$), with neural-network implementations that achieve tilde regret bounds in practical settings. The framework guarantees safety relative to a known baseline with high probability and demonstrates strong empirical performance on real-world data while preserving the safety constraint. The neural extensions leverage NTK-style assumptions and ensemble perturbations to provide end-to-end guarantees, making the approach viable for non-linear, real-world decision problems under safety constraints.

Abstract

Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than $(1+α)$ factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms $\mathtt{C-SquareCB}$ and $\mathtt{C-FastCB}$, using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and that the regret of $\mathtt{C-SquareCB}$ is sub-linear in horizon $T$, while the regret of $\mathtt{C-FastCB}$ is first-order and is sub-linear in $L^*$, the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide $\tilde{O}(\sqrt{KT} + K/α) $ and $\tilde{O}(\sqrt{KL^*} + K (1 + 1/α))$ regret bounds, respectively. Finally, we demonstrate the efficacy of our algorithms on real-world data and show that they significantly outperform the existing baseline while maintaining the performance guarantee.

Conservative Contextual Bandits: Beyond Linear Representations

TL;DR

This work extends Conservative Contextual Bandits to general non-linear cost functions by introducing IGW-based exploration tied to an online regression oracle. It presents two algorithms, C-SquareCB (sublinear regret) and C-FastCB (first-order, data-dependent regret in ), with neural-network implementations that achieve tilde regret bounds in practical settings. The framework guarantees safety relative to a known baseline with high probability and demonstrates strong empirical performance on real-world data while preserving the safety constraint. The neural extensions leverage NTK-style assumptions and ensemble perturbations to provide end-to-end guarantees, making the approach viable for non-linear, real-world decision problems under safety constraints.

Abstract

Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms and , using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and that the regret of is sub-linear in horizon , while the regret of is first-order and is sub-linear in , the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide and regret bounds, respectively. Finally, we demonstrate the efficacy of our algorithms on real-world data and show that they significantly outperform the existing baseline while maintaining the performance guarantee.

Paper Structure

This paper contains 11 sections, 15 theorems, 142 equations, 2 figures, 2 algorithms.

Key Result

Theorem 3.1

Suppose Assumptions asmp:realizability_bandit,asmp:gapBounds and asmp:onlineRegressionSq hold. With probability at least $1-\delta$, $\mathtt{C\mhyphen SquareCB}$ (Algorithm algo:C-SquareCB) satisfies the performance constraint in eq:constraint and has the following regret bound:

Figures (2)

  • Figure 1: Comparison of cumulative regret of $\mathtt{C\text{-}SquareCB}$ and $\mathtt{C\text{-}FastCB}$ with the baseline $\mathtt{C\text{-}LinUCB}$ on openml datasets (averaged over 10 runs).
  • Figure 2: Comparison of Percentage of Constraints violated by $\mathtt{C\text{-}SquareCB}$ and $\mathtt{C\text{-}FastCB}$ with their vanilla non conservative versions on openml datasets (averaged over 100 runs).

Theorems & Definitions (37)

  • Definition 2.1: Regret
  • Definition 2.2: Performance Constraint
  • Theorem 3.1: Regret Bound for C-SquareCB
  • Remark 3.1: Term interpretations
  • Remark 3.2: Infinite actions
  • proof : Proof of Theorem \ref{['Theorem:C-SquareCB']}
  • Remark 3.3: Bounding baseline regret
  • Remark 3.4: Time dependent Exploration
  • Theorem 4.1: Regret Bound for C-FastCB
  • Remark 4.1: First Order Regret
  • ...and 27 more