Table of Contents
Fetching ...

Policy Evaluation Networks

Jean Harb, Tom Schaul, Doina Precup, Pierre-Luc Bacon

TL;DR

Policy Evaluation Networks address the challenge of generalizing value predictions across policies rather than states. By learning a differentiable PVN that predicts a policy's return and using policy fingerprints to embed policies compactly, the method enables deterministic gradient ascent in policy space without new data. Experimental results across a polytope, CartPole, and Swimmer show that gradient ascent through PVN can produce policies that outperform those seen during training, including strong results on Swimmer that beat common baselines. This approach offers a data-efficient alternative to environment-based policy optimization by leveraging learned surrogates for policy evaluation.

Abstract

Many reinforcement learning algorithms use value functions to guide the search for better policies. These methods estimate the value of a single policy while generalizing across many states. The core idea of this paper is to flip this convention and estimate the value of many policies, for a single set of states. This approach opens up the possibility of performing direct gradient ascent in policy space without seeing any new data. The main challenge for this approach is finding a way to represent complex policies that facilitates learning and generalization. To address this problem, we introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements (learned Policy Evaluation Network, policy fingerprints, gradient ascent) can produce policies that outperform those that generated the training data, in zero-shot manner.

Policy Evaluation Networks

TL;DR

Policy Evaluation Networks address the challenge of generalizing value predictions across policies rather than states. By learning a differentiable PVN that predicts a policy's return and using policy fingerprints to embed policies compactly, the method enables deterministic gradient ascent in policy space without new data. Experimental results across a polytope, CartPole, and Swimmer show that gradient ascent through PVN can produce policies that outperform those seen during training, including strong results on Swimmer that beat common baselines. This approach offers a data-efficient alternative to environment-based policy optimization by leveraging learned surrogates for policy evaluation.

Abstract

Many reinforcement learning algorithms use value functions to guide the search for better policies. These methods estimate the value of a single policy while generalizing across many states. The core idea of this paper is to flip this convention and estimate the value of many policies, for a single set of states. This approach opens up the possibility of performing direct gradient ascent in policy space without seeing any new data. The main challenge for this approach is finding a way to represent complex policies that facilitates learning and generalization. To address this problem, we introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements (learned Policy Evaluation Network, policy fingerprints, gradient ascent) can produce policies that outperform those that generated the training data, in zero-shot manner.

Paper Structure

This paper contains 16 sections, 12 equations, 6 figures, 1 table, 3 algorithms.

Figures (6)

  • Figure 1: Diagram of the complete Policy Evaluation Network setup, including Network Fingerprinting (in gray). The blue color of the probing states and PVN indicates that they can be seen as one set of weights, trained in unison.
  • Figure 2: Visualization of a value polytope and a sampled dataset of policies. Training curves show a PVN can learn to generalize and predict the points in the test set.
  • Figure 3: Comparison of gradient fields of the exact and approximated value functions. The two axes in Figures \ref{['fig:exact_grid']} and \ref{['fig:learned_grid']} are the policy spaces in each of the two states, and the arrows represent the gradient $\frac{\partial J(\theta)}{\partial\theta}$ and $\frac{\partial{\bm \psi}({\bm \theta})}{\partial\theta}$. The blue and red dots are steps of the gradient ascent process, mapped onto the polytope in Figure \ref{['fig:ascent_path']}. Both ascents were run for 100 steps.
  • Figure 4: Plots showing histograms of training policies' expected returns and the performance of gradient ascent through a learned PVN. The effects of Network Fingerprinting are drastic when using MLP policies.
  • Figure 5: Gradient ascent performed on Swimmer. We compare the improvement of 5 starting policies and plot the average improvement in bold. Horizontal dashed lines are baselines. Their scores were taken from https://spinningup.openai.com/en/latest/spinningup/bench.html
  • ...and 1 more figures