Table of Contents
Fetching ...

Second Order Methods for Bandit Optimization and Control

Arun Suggala, Y. Jennifer Sun, Praneeth Netrapalli, Elad Hazan

TL;DR

This work develops a practical second-order approach to bandit convex optimization by introducing Bandit Newton Step (BNS) for κ-convex losses, achieving 𝑂̃(d^{2.5}√T) regret with 𝑂(d^2) per-iteration cost and extending to specific losses like logistic regression with favorable dependence on the diameter. It then extends these ideas to online control via affine-memory reductions, introducing Bandit Quadratic Optimization with Affine Memory (BQO-AM) and the Newton Bandit Perturbation Controller (NBPC), which attain 𝑂̃(√T) control regret against fully adversarial disturbances under an affine-memory structure. The paper also proves a near-tight 𝑂̃Ω(T^{2/3}) lower bound for BCO-M with memory, highlighting intrinsic hardness beyond standard BCO and clarifying the separation between bandit optimization with memory and bandit control. Collectively, these results offer a coherent framework for efficient, second-order bandit methods in both online learning and control, while outlining key open questions such as proper-learning variants and extensions to unknown dynamics.

Abstract

Bandit convex optimization (BCO) is a general framework for online decision making under uncertainty. While tight regret bounds for general convex losses have been established, existing algorithms achieving these bounds have prohibitive computational costs for high dimensional data. In this paper, we propose a simple and practical BCO algorithm inspired by the online Newton step algorithm. We show that our algorithm achieves optimal (in terms of horizon) regret bounds for a large class of convex functions that we call $κ$-convex. This class contains a wide range of practically relevant loss functions including linear, quadratic, and generalized linear models. In addition to optimal regret, this method is the most efficient known algorithm for several well-studied applications including bandit logistic regression. Furthermore, we investigate the adaptation of our second-order bandit algorithm to online convex optimization with memory. We show that for loss functions with a certain affine structure, the extended algorithm attains optimal regret. This leads to an algorithm with optimal regret for bandit LQR/LQG problems under a fully adversarial noise model, thereby resolving an open question posed in \citep{gradu2020non} and \citep{sun2023optimal}. Finally, we show that the more general problem of BCO with (non-affine) memory is harder. We derive a $\tildeΩ(T^{2/3})$ regret lower bound, even under the assumption of smooth and quadratic losses.

Second Order Methods for Bandit Optimization and Control

TL;DR

This work develops a practical second-order approach to bandit convex optimization by introducing Bandit Newton Step (BNS) for κ-convex losses, achieving 𝑂̃(d^{2.5}√T) regret with 𝑂(d^2) per-iteration cost and extending to specific losses like logistic regression with favorable dependence on the diameter. It then extends these ideas to online control via affine-memory reductions, introducing Bandit Quadratic Optimization with Affine Memory (BQO-AM) and the Newton Bandit Perturbation Controller (NBPC), which attain 𝑂̃(√T) control regret against fully adversarial disturbances under an affine-memory structure. The paper also proves a near-tight 𝑂̃Ω(T^{2/3}) lower bound for BCO-M with memory, highlighting intrinsic hardness beyond standard BCO and clarifying the separation between bandit optimization with memory and bandit control. Collectively, these results offer a coherent framework for efficient, second-order bandit methods in both online learning and control, while outlining key open questions such as proper-learning variants and extensions to unknown dynamics.

Abstract

Bandit convex optimization (BCO) is a general framework for online decision making under uncertainty. While tight regret bounds for general convex losses have been established, existing algorithms achieving these bounds have prohibitive computational costs for high dimensional data. In this paper, we propose a simple and practical BCO algorithm inspired by the online Newton step algorithm. We show that our algorithm achieves optimal (in terms of horizon) regret bounds for a large class of convex functions that we call -convex. This class contains a wide range of practically relevant loss functions including linear, quadratic, and generalized linear models. In addition to optimal regret, this method is the most efficient known algorithm for several well-studied applications including bandit logistic regression. Furthermore, we investigate the adaptation of our second-order bandit algorithm to online convex optimization with memory. We show that for loss functions with a certain affine structure, the extended algorithm attains optimal regret. This leads to an algorithm with optimal regret for bandit LQR/LQG problems under a fully adversarial noise model, thereby resolving an open question posed in \citep{gradu2020non} and \citep{sun2023optimal}. Finally, we show that the more general problem of BCO with (non-affine) memory is harder. We derive a regret lower bound, even under the assumption of smooth and quadratic losses.
Paper Structure (82 sections, 34 theorems, 263 equations, 1 figure, 4 tables, 5 algorithms)

This paper contains 82 sections, 34 theorems, 263 equations, 1 figure, 4 tables, 5 algorithms.

Key Result

theorem 1

For $d,T\in\mathbb{N}$, suppose that $\{f_t\}_{t=1}^T$ and the convex compact set $\mathcal{K}\subset\mathbb{R}^d$ satisfy Assumptions assumption:oblivious-adversary,assumption:curvature,assumption:bounded-range-and-grads. Let $B^* \coloneqq B+\sqrt{2}(L+\sqrt{2}C)$. Then, BNS (alg:simple-bqo) with For the case of $\mathrm{diam}(\mathcal{K}) = \sqrt{d}$, by setting $\kappa'=\kappa$, and $\eta= (2

Figures (1)

  • Figure :

Theorems & Definitions (66)

  • definition 1: $\kappa$-convexity
  • definition 2: DRC policy class
  • theorem 1: BNS regret
  • proof
  • theorem 2: BNS-AM regret
  • theorem 3: NBPC regret, Algorithm \ref{['alg:bandit-control-known']}
  • theorem 4: BCO-M lower bound
  • definition 3: Exp-concave functions, hazan2022oco
  • corollary 1: Regret bound for bandit logistic regression
  • corollary 2: Regret bound for bandit linear regression
  • ...and 56 more