Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

Stefana Anita; Gabriel Turinici

Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

Stefana Anita, Gabriel Turinici

TL;DR

This work investigates the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the'softmax'parametrization and proves convergence under appropriate technical hypotheses.

Abstract

Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.

Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

TL;DR

This work investigates the convergence of such a procedure for the situation when a

regularization term is present jointly with the'softmax'parametrization and proves convergence under appropriate technical hypotheses.

Abstract

regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.

Paper Structure (11 sections, 6 theorems, 56 equations, 3 figures)

This paper contains 11 sections, 6 theorems, 56 equations, 3 figures.

Introduction
Brief literature review
The softmax parameterized policy gradient Multi Armed Bandit with $L2$ regularization
Theoretical convergence results
Fixed time step
Convergence rates for linear decay $\rho_t= \frac{\beta_1}{1+\beta_2 t }$ and large $\gamma$
Behavior when $\gamma\to 0$
Numerical simulations
Summary and discussion
Appendix
Further comments on the assumption $\mu>0$

Key Result

lemma thmcounterlemma

Under hypotheses eq:hyp_mean and eq:hyp_second_moment : Moreover, for some constant $C_{q_*}$ only depending on $q_*$ and $C_m$ :

Figures (3)

Figure 1: The average reward for $\rho_t=0.05$ (constant), $\gamma$ is $0$, $0.01$ or $10$ (see the legend). Left : start from a uniform distribution $\Pi_{H_0}$ with $H_0=(0,...,0)$. Right : start from a biased distribution $\Pi_{H_0}$ with $H_0=(5,...,0)$.
Figure 2: The average reward when starting from the non-uniform distribution $\Pi_{H_0}$ with $H_0=(5,...,0)$ and $\rho_t=\frac{1}{1+0.05*t}$ in the general setting of proposition \ref{['prop:cv_rate']} equation \ref{['eq:rhot_beta_def']}; we test $\gamma=0$, $\gamma=0.01$ or $\gamma=10$ (see the legend). As before, $\gamma=10$ is too large to obtain good results.
Figure 3: The average reward when starting from the biased distribution $\Pi_{H_0}$ with $H_0=(5,...,0)$ and $\gamma_t= \frac{\gamma_0}{1+0.2*t}$ (see the legend), $\gamma_0=0$ (no regularization) or $\gamma_0=10$. We take $\rho_t=\frac{1}{1+0.05*t}$ (see eq. \ref{['eq:rhot_beta_def']}).

Theorems & Definitions (13)

lemma thmcounterlemma
remark thmcounterremark
proof
proposition thmcounterproposition
proof
lemma thmcounterlemma
proof
proposition thmcounterproposition
proof
lemma thmcounterlemma
...and 3 more

Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

TL;DR

Abstract

Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (13)