PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods

WooJae Jeon; KangJun Lee; Jeewoo Lee

PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods

WooJae Jeon, KangJun Lee, Jeewoo Lee

TL;DR

PG-Rainbow addresses the sample inefficiency and distribution-insensitive nature of on-policy policy gradient methods by integrating distributional reinforcement learning into PPO. It connects an implicit quantile network (IQN) with a distillation network to transfer quantile information into PPO’s value head, producing a distribution-aware critic: $V_{\\psi}(s)=f_{\\psi}(V_{\\theta}(s) \\odot q_{\\phi}(s))$. Empirical results on Atari via ALE show that PG-Rainbow generally outperforms PPO, with ablations demonstrating that timely distillation, number of quantiles, and the chosen input form for the distillation network all critically affect performance. This hybrid approach leverages off-policy distributional information to boost on-policy learning, suggesting a practical path toward more data-efficient and robust reinforcement learning in discrete-action domains.

Abstract

This paper introduces PG-Rainbow, a novel algorithm that incorporates a distributional reinforcement learning framework with a policy gradient algorithm. Existing policy gradient methods are sample inefficient and rely on the mean of returns when calculating the state-action value function, neglecting the distributional nature of returns in reinforcement learning tasks. To address this issue, we use an Implicit Quantile Network that provides the quantile information of the distribution of rewards to the critic network of the Proximal Policy Optimization algorithm. We show empirical results that through the integration of reward distribution information into the policy network, the policy agent acquires enhanced capabilities to comprehensively evaluate the consequences of potential actions in a given state, facilitating more sophisticated and informed decision-making processes. We evaluate the performance of the proposed algorithm in the Atari-2600 game suite, simulated via the Arcade Learning Environment (ALE).

PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods

TL;DR

. Empirical results on Atari via ALE show that PG-Rainbow generally outperforms PPO, with ablations demonstrating that timely distillation, number of quantiles, and the chosen input form for the distillation network all critically affect performance. This hybrid approach leverages off-policy distributional information to boost on-policy learning, suggesting a practical path toward more data-efficient and robust reinforcement learning in discrete-action domains.

Abstract

Paper Structure (16 sections, 10 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 10 equations, 12 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Policy Gradient Methods
Distributional Reinfocement Learning
PG-Rainbow
Analyzing Multi-Modality of Rewards
Utilizing a distillation network
Algorithm
Experiments and Results
Implementation Details
Results
Conclusion
Complete list of hyperparameters
Complete Atari Training Results (PG-Rainbow)
Comparsion of results with Lagging PG-Rainbow
...and 1 more sections

Figures (12)

Figure 1: Comparison of model architecture between PPO and PG-Rainbow. PG-Rainbow uses a distillation network to incorporate quantile value distribution data into the policy network.
Figure 2: Histograms depicting the distribution of returns for the value function and action NO-OP in the DemonAttack environment of the Atari suite. The blue histogram represent the returns captured by the value function, while the orange histogram depict the returns captured by the Q function. The disparity between the distributions highlights the limitations of the value function in accurately capturing the range of returns for each action value.
Figure 3: Comparison of performance between using a distillation network and replacing the value function. The use of a distillation network significantly outperforms a model where the PPO value function is replaced with the IQN value function. It is observable that the model with the replacement method completely fails to improve its performance.
Figure 4: Performance of PG-Rainbow and PPO on Atari environments. Plots show the average episodic returns of the agent trained for 1 million timesteps. PG-Rainbow outperforms PPO in most environments.
Figure 5: Performance of PG-Rainbow and Delayed PG-Rainbow on DemonAttack and BeamRider. Plots show the average episodic returns of the agent trained for 1 million timesteps. Results show that deferring the distillation process to the PPO agent adversely affects its performance.
...and 7 more figures

PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods

TL;DR

Abstract

PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods

Authors

TL;DR

Abstract

Table of Contents

Figures (12)