Safe Online Convex Optimization with Multi-Point Feedback
Spencer Hutchinson, Mahnoosh Alizadeh
TL;DR
This work studies safe online convex optimization with an unknown constraint under zero-order, multi-point feedback. It introduces MP-ROGD, a projection-free algorithm that uses forward-difference gradient estimation and optimistic/pessimistic action sets to ensure zero constraint violations while achieving $O(d\sqrt{T})$ regret when the constraint is smooth and strongly convex. The analysis provides gradient-estimation error bounds, set-containment properties, and a regret bound, supported by a proof sketch and auxiliary lemmas, alongside numerical experiments. Empirical results compare MP-ROGD to baselines with full constraint information and first-order feedback, illustrating the trade-offs between constraint knowledge and zero-order information and highlighting the method's practical relevance for safe learning and control under bandit feedback.
Abstract
Motivated by the stringent safety requirements that are often present in real-world applications, we study a safe online convex optimization setting where the player needs to simultaneously achieve sublinear regret and zero constraint violation while only using zero-order information. In particular, we consider a multi-point feedback setting, where the player chooses $d + 1$ points in each round (where $d$ is the problem dimension) and then receives the value of the constraint function and cost function at each of these points. To address this problem, we propose an algorithm that leverages forward-difference gradient estimation as well as optimistic and pessimistic action sets to achieve $\mathcal{O}(d \sqrt{T})$ regret and zero constraint violation under the assumption that the constraint function is smooth and strongly convex. We then perform a numerical study to investigate the impacts of the unknown constraint and zero-order feedback on empirical performance.
