Table of Contents
Fetching ...

From Parameters to Behaviors: Unsupervised Compression of the Policy Space

Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli

TL;DR

A novel, unsupervised approach that compresses the policy parameter space $\Theta$ into a low-dimensional latent space $\mathcal{Z}$ and shows that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space $\mathcal{Z}$.

Abstract

Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient. We argue that this inefficiency stems from the standard practice of optimizing policies directly in the high-dimensional and highly redundant parameter space $Θ$. This challenge is greatly compounded in multi-task settings. In this work, we develop a novel, unsupervised approach that compresses the policy parameter space $Θ$ into a low-dimensional latent space $\mathcal{Z}$. We train a generative model $g:\mathcal{Z}\toΘ$ by optimizing a behavioral reconstruction loss, which ensures that the latent space is organized by functional similarity rather than proximity in parameterization. We conjecture that the inherent dimensionality of this manifold is a function of the environment's complexity, rather than the size of the policy network. We validate our approach in continuous control domains, showing that the parameterization of standard policy networks can be compressed up to five orders of magnitude while retaining most of its expressivity. As a byproduct, we show that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space $\mathcal{Z}$.

From Parameters to Behaviors: Unsupervised Compression of the Policy Space

TL;DR

A novel, unsupervised approach that compresses the policy parameter space into a low-dimensional latent space and shows that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space .

Abstract

Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient. We argue that this inefficiency stems from the standard practice of optimizing policies directly in the high-dimensional and highly redundant parameter space . This challenge is greatly compounded in multi-task settings. In this work, we develop a novel, unsupervised approach that compresses the policy parameter space into a low-dimensional latent space . We train a generative model by optimizing a behavioral reconstruction loss, which ensures that the latent space is organized by functional similarity rather than proximity in parameterization. We conjecture that the inherent dimensionality of this manifold is a function of the environment's complexity, rather than the size of the policy network. We validate our approach in continuous control domains, showing that the parameterization of standard policy networks can be compressed up to five orders of magnitude while retaining most of its expressivity. As a byproduct, we show that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space .

Paper Structure

This paper contains 10 sections, 11 equations, 28 figures, 3 tables.

Figures (28)

  • Figure 1: Autoencoder Spaces and Data Manifold.
  • Figure 2: Pipeline of Unsupervised Compression of the Policy Space.
  • Figure 3: Landscape of the Latent Behavior Manifold. Lighter and darker colors indicate higher and lower returns of the decoded policy. The plots shown here represent a subset of the full results reported in Appendix \ref{['ap:exp']}. We consider a specific seed with different tasks (height, standard, speed), policy size (Small, Medium, Large), and encoding dimension (1D, 2D, 3D), for both MC (first three columns, datasets of 50k policies) and RC (last column, datasets of 100k policies).
  • Figure 4: Performance comparison in MC for different tasks. We report the average and 95% confidence interval over 10 runs.
  • Figure 5: Performance comparison in RC for different tasks. We report the average and 95% confidence interval over 10 runs. For clarity, the worst-performing baselines are omitted. A full study is reported in Appendix \ref{['ap:exp']}.
  • ...and 23 more figures