Table of Contents
Fetching ...

Upside Down Reinforcement Learning with Policy Generators

Jacopo Di Ventura, Dylan R. Ashley, Vincent Herrmann, Francesco Faccio, Jürgen Schmidhuber

TL;DR

The paper addresses learning command-conditioned RL policies without a separate evaluator by proposing UDRLPG, which uses a hypernetwork to map return commands $c$ to policy parameters $\theta$ via $\theta = G_\rho(c)$. Training employs hindsight learning and a bucketed replay buffer, enabling the generator to cover the return spectrum and generalize to unseen returns. Empirical results show UDRLPG is competitive with GoGePo and DDPG across multiple Gym tasks, with zero-shot generalization and improved exploration, albeit with some environments exhibiting slower convergence and higher final-variance. Overall, UDRLPG demonstrates a simpler, potentially more sample-efficient route to policy generation in RL and highlights how initialization bias in hypernetworks can influence multimodality and learning dynamics.

Abstract

Upside Down Reinforcement Learning (UDRL) is a promising framework for solving reinforcement learning problems which focuses on learning command-conditioned policies. In this work, we extend UDRL to the task of learning a command-conditioned generator of deep neural network policies. We accomplish this using Hypernetworks - a variant of Fast Weight Programmers, which learn to decode input commands representing a desired expected return into command-specific weight matrices. Our method, dubbed Upside Down Reinforcement Learning with Policy Generators (UDRLPG), streamlines comparable techniques by removing the need for an evaluator or critic to update the weights of the generator. To counteract the increased variance in last returns caused by not having an evaluator, we decouple the sampling probability of the buffer from the absolute number of policies in it, which, together with a simple weighting strategy, improves the empirical convergence of the algorithm. Compared with existing algorithms, UDRLPG achieves competitive performance and high returns, sometimes outperforming more complex architectures. Our experiments show that a trained generator can generalize to create policies that achieve unseen returns zero-shot. The proposed method appears to be effective in mitigating some of the challenges associated with learning highly multimodal functions. Altogether, we believe that UDRLPG represents a promising step forward in achieving greater empirical sample efficiency in RL. A full implementation of UDRLPG is publicly available at https://github.com/JacopoD/udrlpg_

Upside Down Reinforcement Learning with Policy Generators

TL;DR

The paper addresses learning command-conditioned RL policies without a separate evaluator by proposing UDRLPG, which uses a hypernetwork to map return commands to policy parameters via . Training employs hindsight learning and a bucketed replay buffer, enabling the generator to cover the return spectrum and generalize to unseen returns. Empirical results show UDRLPG is competitive with GoGePo and DDPG across multiple Gym tasks, with zero-shot generalization and improved exploration, albeit with some environments exhibiting slower convergence and higher final-variance. Overall, UDRLPG demonstrates a simpler, potentially more sample-efficient route to policy generation in RL and highlights how initialization bias in hypernetworks can influence multimodality and learning dynamics.

Abstract

Upside Down Reinforcement Learning (UDRL) is a promising framework for solving reinforcement learning problems which focuses on learning command-conditioned policies. In this work, we extend UDRL to the task of learning a command-conditioned generator of deep neural network policies. We accomplish this using Hypernetworks - a variant of Fast Weight Programmers, which learn to decode input commands representing a desired expected return into command-specific weight matrices. Our method, dubbed Upside Down Reinforcement Learning with Policy Generators (UDRLPG), streamlines comparable techniques by removing the need for an evaluator or critic to update the weights of the generator. To counteract the increased variance in last returns caused by not having an evaluator, we decouple the sampling probability of the buffer from the absolute number of policies in it, which, together with a simple weighting strategy, improves the empirical convergence of the algorithm. Compared with existing algorithms, UDRLPG achieves competitive performance and high returns, sometimes outperforming more complex architectures. Our experiments show that a trained generator can generalize to create policies that achieve unseen returns zero-shot. The proposed method appears to be effective in mitigating some of the challenges associated with learning highly multimodal functions. Altogether, we believe that UDRLPG represents a promising step forward in achieving greater empirical sample efficiency in RL. A full implementation of UDRLPG is publicly available at https://github.com/JacopoD/udrlpg_

Paper Structure

This paper contains 6 sections, 4 figures.

Figures (4)

  • Figure 1: Performance of policies from UDRLPG, GoGePo, and DDPG during training in all environments. Lines show mean return and $95\%$ bootstrapped confidence intervals from $20$ independent runs.
  • Figure 2: Variance of final returns. Lines show mean and $95\%$ confidence intervals from $20$ evaluation runs.
  • Figure 3: Performance comparison of UDRLPG policies across test all environments using four buffer strategies. The proposed strategy is buckets and weighted sampling, in blue. Curves show mean returns with $95\%$ bootstrapped confidence intervals from $20$ runs.
  • Figure 4: Mean returns over $10$ episodes of policies from the generator as a function of the given command. Results are averaged over $5$ independent runs.