Conservative Contextual Bandits: Beyond Linear Representations
Rohan Deb, Mohammad Ghavamzadeh, Arindam Banerjee
TL;DR
This work extends Conservative Contextual Bandits to general non-linear cost functions by introducing IGW-based exploration tied to an online regression oracle. It presents two algorithms, C-SquareCB (sublinear regret) and C-FastCB (first-order, data-dependent regret in $L^*$), with neural-network implementations that achieve tilde regret bounds in practical settings. The framework guarantees safety relative to a known baseline with high probability and demonstrates strong empirical performance on real-world data while preserving the safety constraint. The neural extensions leverage NTK-style assumptions and ensemble perturbations to provide end-to-end guarantees, making the approach viable for non-linear, real-world decision problems under safety constraints.
Abstract
Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than $(1+α)$ factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms $\mathtt{C-SquareCB}$ and $\mathtt{C-FastCB}$, using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and that the regret of $\mathtt{C-SquareCB}$ is sub-linear in horizon $T$, while the regret of $\mathtt{C-FastCB}$ is first-order and is sub-linear in $L^*$, the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide $\tilde{O}(\sqrt{KT} + K/α) $ and $\tilde{O}(\sqrt{KL^*} + K (1 + 1/α))$ regret bounds, respectively. Finally, we demonstrate the efficacy of our algorithms on real-world data and show that they significantly outperform the existing baseline while maintaining the performance guarantee.
