Table of Contents
Fetching ...

Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence

Lingwei Zhu, Zheng Chen, Matthew Schlegel, Martha White

TL;DR

A generalized KL divergence is investigated -- called the Tsallis KL divergence -- which use the $q-logarithm in the definition, and it is shown that this generalized MVI obtains significant improvements over the standard MVI across 35 Atari games.

Abstract

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.

Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence

TL;DR

A generalized KL divergence is investigated -- called the Tsallis KL divergence -- which use the $q-logarithm in the definition, and it is shown that this generalized MVI obtains significant improvements over the standard MVI across 35 Atari games.

Abstract

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the -logarithm in the definition. The approach is a strict generalization, as corresponds to the standard KL divergence; provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI() obtains significant improvements over the standard MVI() across 35 Atari games.
Paper Structure (20 sections, 6 theorems, 39 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 6 theorems, 39 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $\Omega(\pi) = -S_q(\pi)$ in Eq. (eq:regularizedPI). Then the regularized optimal policies can be expressed: where $\psi_q = \tilde{\psi}_q + \frac{1}{1-q}$. Additionally, for an arbitrary $(q, \tau)$ pair with $q > 1$, the same truncation effect (support) can be achieved using $(q=2, \frac{\tau}{q-1})$.

Figures (9)

  • Figure 1: $\ln_q x$, $\exp_q x$ and Tsallis entropy component $-\pi^q\ln_q\pi$ for $q=1$ to $5$. When $q=1$ they respectively recover their standard counterpart. $\pi$ is chosen to be Gaussian $\mathcal{N}(2,1)$. As $q$ gets larger $\ln_q x$ (and hence Tsallis entropy) becomes more flat and $\exp_q x$ more steep.
  • Figure 2: (Left) Tsallis KL component $-\pi_1\ln_q\frac{\pi_2}{\pi_1}$ between two Gaussian policies $\pi_1 = \mathcal{N}(2.75, 1), \pi_2=\mathcal{N}(3.25, 1)$ for $q=1$ to $5$. When $q=1$ TKL recovers KL. For $q>1$, TKL is more mode-covering than KL. (Mid) The sparsemax operator acting on a Boltzmann policy when $q=2$. (Right) The sparsemax when $q=50$. Truncation gets stronger as $q$ gets larger. The same effect can be also controlled by $\tau$.
  • Figure 3: MVI$(q)$ on CartPole-v1 for $q=2, 3, 4, 5$, averaged over 50 seeds, with $\tau = 0.03, \alpha =0.9$. (Left) The difference between the proposed action gap $Q_k - \mathcal{M}_{q,\tau}{Q_k}$ and the general Munchausen term $\ln_q \pi_{k+1}$ converges to a constant. (Right) The residual $R_q(\pi_{k+1}, \pi_k)$ becomes larger as $q$ increases. For $q=2$, it remains negligible throughout the learning.
  • Figure 4: Learning curves of MVI($q$) and M-VI on the selected Atari games, averaged over 3 independent runs, with ribbon denoting the standard error. On some environments MVI($q$) significantly improve upon M-VI. Quantitative improvements over M-VI and Tsallis-VI are shown in Figures \ref{['fig:evi_mdqn']}.
  • Figure 5: (Left) The percent improvement of MVI($q$) with $q = 2$ over standard MVI (where $q=1$) on select Atari games. The improvement is computed by subtracting the scores from MVI($q$) and MVI and normalizing by the MVI scores. (Right) Improvement over Tsallis-VI on Atari environments, normalized with Tsallis-VI scores.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Lemma 1: geist19-regularized
  • Lemma 2: Eq. (25) of Yamano2004-properties-qlogexp