Table of Contents
Fetching ...

Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates

Jincheng Mei, Bo Dai, Alekh Agarwal, Sharan Vaswani, Anant Raj, Csaba Szepesvari, Dale Schuurmans

TL;DR

The paper addresses the theoretical question of whether stochastic gradient bandits converge globally when using a constant learning rate $\eta>0$. It develops a probabilistic and martingale-based framework that reveals an intrinsic exploration property: the algorithm cannot lock onto a single action forever, and, for any $\eta>0$, converges almost surely to the globally optimal policy. Key contributions include showing that at least two actions are sampled infinitely often, extending the analysis from the two-action case to all $K\ge 2$, and deriving an $O\left( \frac{\log T}{T} \right)$ rate for averaged iterates. The results provide a robust theoretical foundation for stochastic gradient bandits, with practical implications for RL and large-scale optimization where decaying learning rates are undesirable or impractical.

Abstract

We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.

Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates

TL;DR

The paper addresses the theoretical question of whether stochastic gradient bandits converge globally when using a constant learning rate . It develops a probabilistic and martingale-based framework that reveals an intrinsic exploration property: the algorithm cannot lock onto a single action forever, and, for any , converges almost surely to the globally optimal policy. Key contributions include showing that at least two actions are sampled infinitely often, extending the analysis from the two-action case to all , and deriving an rate for averaged iterates. The results provide a robust theoretical foundation for stochastic gradient bandits, with practical implications for RL and large-scale optimization where decaying learning rates are undesirable or impractical.

Abstract

We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.

Paper Structure

This paper contains 25 sections, 19 theorems, 206 equations, 3 figures, 1 algorithm.

Key Result

Proposition 1

alg:gradient_bandit_algorithm_sampled_reward is equivalent to the following update, where $\mathbb{E}_t{ [ \frac{d \ \pi_{\theta_t}^\top \hat{r}_t }{d \theta_t} ] } = \frac{d \ \pi_{\theta_t}^\top r}{d \theta_t }$, and $\mathbb{E}_t[\cdot]$ is defined with respect to randomness from on-policy sampling $a_t \sim \pi_{\theta_t}(\cdot)$ and reward sampling $R_t(a_t)\sim P_{a_t}$. The

Figures (3)

  • Figure 1: Log sub-optimality gap, $\log{ (r(a^*) - \pi_{\theta_t}^\top r) }$, plotted against the logarithm of time, $\log{t}$, in a $4$-action problem with various learning rates, $\eta$. Each subplot shows a run with a specific learning rate. The curves in a subplot correspond to 10 different random seeds. Theory predicts that essentially all seeds will lead to a curve converging to zero ($-\infty$ in these plots). For a discussion of the results, see the text.,
  • Figure 2: Visualization in a two-action stochastic bandit problem. Here the rewards are defined as $(-0.05, -0.25)$. Other details are same as for \ref{['fig:visualization_general_action_case']}. Figures \ref{['fig:optimal_action_prob_two_action_case_eta_100']} and \ref{['fig:optimal_action_prob_two_action_case_eta_100']} are based on a single run, while Figure \ref{['fig:sub_optimality_gap_two_action_case_eta_10']} averages across 10 runs. Note that $\log{ ( r(a^*) - \pi_{\theta_t}^\top r ) } \approx 10^{-33}$ at the final stages on \ref{['fig:sub_optimality_gap_two_action_case_eta_100']}.
  • Figure : Gradient bandit algorithm (without baselines)

Theorems & Definitions (36)

  • Proposition 1: Proposition 2.3 of mei2024stochastic
  • Remark 1
  • Lemma 1
  • Lemma 2: Avoiding a lack of exploration
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • proof
  • proof
  • proof
  • ...and 26 more