Table of Contents
Fetching ...

Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

Stefana Anita, Gabriel Turinici

TL;DR

This work investigates the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the'softmax'parametrization and proves convergence under appropriate technical hypotheses.

Abstract

Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.

Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

TL;DR

This work investigates the convergence of such a procedure for the situation when a regularization term is present jointly with the'softmax'parametrization and proves convergence under appropriate technical hypotheses.

Abstract

Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.
Paper Structure (11 sections, 6 theorems, 56 equations, 3 figures)

This paper contains 11 sections, 6 theorems, 56 equations, 3 figures.

Key Result

lemma thmcounterlemma

Under hypotheses eq:hyp_mean and eq:hyp_second_moment : Moreover, for some constant $C_{q_*}$ only depending on $q_*$ and $C_m$ :

Figures (3)

  • Figure 1: The average reward for $\rho_t=0.05$ (constant), $\gamma$ is $0$, $0.01$ or $10$ (see the legend). Left : start from a uniform distribution $\Pi_{H_0}$ with $H_0=(0,...,0)$. Right : start from a biased distribution $\Pi_{H_0}$ with $H_0=(5,...,0)$.
  • Figure 2: The average reward when starting from the non-uniform distribution $\Pi_{H_0}$ with $H_0=(5,...,0)$ and $\rho_t=\frac{1}{1+0.05*t}$ in the general setting of proposition \ref{['prop:cv_rate']} equation \ref{['eq:rhot_beta_def']}; we test $\gamma=0$, $\gamma=0.01$ or $\gamma=10$ (see the legend). As before, $\gamma=10$ is too large to obtain good results.
  • Figure 3: The average reward when starting from the biased distribution $\Pi_{H_0}$ with $H_0=(5,...,0)$ and $\gamma_t= \frac{\gamma_0}{1+0.2*t}$ (see the legend), $\gamma_0=0$ (no regularization) or $\gamma_0=10$. We take $\rho_t=\frac{1}{1+0.05*t}$ (see eq. \ref{['eq:rhot_beta_def']}).

Theorems & Definitions (13)

  • lemma thmcounterlemma
  • remark thmcounterremark
  • proof
  • proposition thmcounterproposition
  • proof
  • lemma thmcounterlemma
  • proof
  • proposition thmcounterproposition
  • proof
  • lemma thmcounterlemma
  • ...and 3 more