Non-stationary Bandit Convex Optimization: A Comprehensive Study
Xiaoqi Liu, Dorian Baudry, Julian Zimmert, Patrick Rebeschini, Arya Akhavan
TL;DR
The paper studies non-stationary Bandit Convex Optimization in continuous action spaces under three non-stationarity measures: the number of switches S, total variation Δ, and path-length P. It introduces TEWA-SE, a polynomial-time, zeroth-order sleeping-experts algorithm with one-point gradient estimates, achieving adaptive interval regret and minimax-optimal dynamic/regret bounds for strongly convex losses with known S and Δ, plus parameter-free extensions via Bandit-over-Bandit. It also presents cExO, a discretized-exponential-weights method with clipping that attains minimax-optimal S and Δ guarantees for general convex losses (though not polynomial-time computable and with higher dimension dependence), and improves path-length regret, with BoB-based variants for unknown non-stationarity. The authors provide matching lower bounds, unify conversions among regret notions, and highlight the remaining open challenge of designing computationally efficient, minimax-optimal algorithms for general convex non-stationary BCO. Overall, the work advances a unified framework for non-stationary BCO, connecting OCO techniques with bandit feedback and setting the stage for future efficient second-order methods.
Abstract
Bandit Convex Optimization is a fundamental class of sequential decision-making problems, where the learner selects actions from a continuous domain and observes a loss (but not its gradient) at only one point per round. We study this problem in non-stationary environments, and aim to minimize the regret under three standard measures of non-stationarity: the number of switches $S$ in the comparator sequence, the total variation $Δ$ of the loss functions, and the path-length $P$ of the comparator sequence. We propose a polynomial-time algorithm, Tilted Exponentially Weighted Average with Sleeping Experts (TEWA-SE), which adapts the sleeping experts framework from online convex optimization to the bandit setting. For strongly convex losses, we prove that TEWA-SE is minimax-optimal with respect to known $S$ and $Δ$ by establishing matching upper and lower bounds. By equipping TEWA-SE with the Bandit-over-Bandit framework, we extend our analysis to environments with unknown non-stationarity measures. For general convex losses, we introduce a second algorithm, clipped Exploration by Optimization (cExO), based on exponential weights over a discretized action space. While not polynomial-time computable, this method achieves minimax-optimal regret with respect to known $S$ and $Δ$, and improves on the best existing bounds with respect to $P$.
