Table of Contents
Fetching ...

How Does Return Distribution in Distributional Reinforcement Learning Help Optimization?

Ke Sun, Bei Jiang, Linglong Kong

TL;DR

It is demonstrated that the distribution loss of distributional RL has desirable smoothness characteristics and hence enjoys stable gradients, which is in line with its tendency to promote optimization stability and how the return distribution in distributional RL algorithms helps the optimization.

Abstract

Distributional reinforcement learning, which focuses on learning the entire return distribution instead of only its expectation in standard RL, has demonstrated remarkable success in enhancing performance. Despite these advancements, our comprehension of how the return distribution within distributional RL still remains limited. In this study, we investigate the optimization advantages of distributional RL by utilizing its extra return distribution knowledge over classical RL within the Neural Fitted Z-Iteration~(Neural FZI) framework. To begin with, we demonstrate that the distribution loss of distributional RL has desirable smoothness characteristics and hence enjoys stable gradients, which is in line with its tendency to promote optimization stability. Furthermore, the acceleration effect of distributional RL is revealed by decomposing the return distribution. It shows that distributional RL can perform favorably if the return distribution approximation is appropriate, measured by the variance of gradient estimates in each environment. Rigorous experiments validate the stable optimization behaviors of distributional RL and its acceleration effects compared to classical RL. Our research findings illuminate how the return distribution in distributional RL algorithms helps the optimization.

How Does Return Distribution in Distributional Reinforcement Learning Help Optimization?

TL;DR

It is demonstrated that the distribution loss of distributional RL has desirable smoothness characteristics and hence enjoys stable gradients, which is in line with its tendency to promote optimization stability and how the return distribution in distributional RL algorithms helps the optimization.

Abstract

Distributional reinforcement learning, which focuses on learning the entire return distribution instead of only its expectation in standard RL, has demonstrated remarkable success in enhancing performance. Despite these advancements, our comprehension of how the return distribution within distributional RL still remains limited. In this study, we investigate the optimization advantages of distributional RL by utilizing its extra return distribution knowledge over classical RL within the Neural Fitted Z-Iteration~(Neural FZI) framework. To begin with, we demonstrate that the distribution loss of distributional RL has desirable smoothness characteristics and hence enjoys stable gradients, which is in line with its tendency to promote optimization stability. Furthermore, the acceleration effect of distributional RL is revealed by decomposing the return distribution. It shows that distributional RL can perform favorably if the return distribution approximation is appropriate, measured by the variance of gradient estimates in each environment. Rigorous experiments validate the stable optimization behaviors of distributional RL and its acceleration effects compared to classical RL. Our research findings illuminate how the return distribution in distributional RL algorithms helps the optimization.
Paper Structure (23 sections, 8 theorems, 38 equations, 5 figures, 1 table)

This paper contains 23 sections, 8 theorems, 38 equations, 5 figures, 1 table.

Key Result

Proposition 1

(Properties of Categorical Distributional Loss) Assume the state features $\Vert \mathbf{x}(s) \Vert_2 \leq l$ for each state $s$, then $\mathcal{L}_\theta$ is $kl$-Lipschitz continuous, $kl^2$-smooth and convex w.r.t. the parameter $\theta$.

Figures (5)

  • Figure 1: Performance. Learning curve of AC, DAC (C51), and DAC (IQN) over five seeds with smooth size five across eight MuJoCo games.
  • Figure 2: Uniform Stability. The critic gradient norms in the logarithmic scale regarding the state during the training of AC, DAC (C51), DAC (IQN) over 5 seeds on eight MuJoCo environments.
  • Figure 3: Acceleration Effect. The critic gradient norms in the logarithmic scale regarding network parameters in the training of AC, DAC (C51), DAC (IQN) over 5 seeds on MuJoCo environments.
  • Figure 4: The critic gradient norms in the logarithmic scale during the training of AC and DAC (C51) over five seeds on three MuJoCo games. We keep the same DAC network architecture and evaluate based on the expectation of the represented value distribution.
  • Figure 5: The critic gradient norms in the logarithmic scale during the training of AC and DAC (C51) over five seeds on three MuJoCo games. Results of AC is the expectation part calculated via the Return Density Function Decomposition.

Theorems & Definitions (19)

  • Proposition 1
  • Definition 1
  • Theorem 1
  • Proposition 2
  • Definition 2
  • Theorem 2
  • proof
  • proof
  • Definition 3
  • Definition 4
  • ...and 9 more