Table of Contents
Fetching ...

Massively Scaling Explicit Policy-conditioned Value Functions

Nico Bohlinger, Jan Peters

TL;DR

The paper addresses the challenge of scaling Explicit Policy-conditioned Value Functions (EPVFs) for continuous-control tasks, where naive scaling can cause uncontrolled growth in policy representations and exploration inefficiency. It introduces a scaling strategy that combines massive GPU-based parallelization, large batch sizes, weight clipping, and scaled perturbations to train EPVFs effectively. Empirical results show EPVFs can reach competitive performance with state-of-the-art DRL baselines like PPO and SAC on tasks such as Cartpole and a custom Ant environment; input representations matter, with raw policy parameters generally outperforming action-based probes, though weight-space architectures can match or exceed baselines. The work demonstrates the viability of offline/off-policy EPVF learning at scale and highlights design choices—especially weight clipping and parameter-space exploration—that stabilize training and improve exploration, while pointing to future exploration of weight-space representations to further enhance scaling.

Abstract

We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V(θ) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy parameter representations from previous work and specialized neural network architectures to efficiently handle weight-space features, which have not been used in the context of DRL before.

Massively Scaling Explicit Policy-conditioned Value Functions

TL;DR

The paper addresses the challenge of scaling Explicit Policy-conditioned Value Functions (EPVFs) for continuous-control tasks, where naive scaling can cause uncontrolled growth in policy representations and exploration inefficiency. It introduces a scaling strategy that combines massive GPU-based parallelization, large batch sizes, weight clipping, and scaled perturbations to train EPVFs effectively. Empirical results show EPVFs can reach competitive performance with state-of-the-art DRL baselines like PPO and SAC on tasks such as Cartpole and a custom Ant environment; input representations matter, with raw policy parameters generally outperforming action-based probes, though weight-space architectures can match or exceed baselines. The work demonstrates the viability of offline/off-policy EPVF learning at scale and highlights design choices—especially weight clipping and parameter-space exploration—that stabilize training and improve exploration, while pointing to future exploration of weight-space representations to further enhance scaling.

Abstract

We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V(θ) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy parameter representations from previous work and specialized neural network architectures to efficiently handle weight-space features, which have not been used in the context of DRL before.

Paper Structure

This paper contains 4 sections, 2 equations, 2 figures, 1 algorithm.

Figures (2)

  • Figure 1: Performance on Cartpole with different numbers of parallel environments. The return is the average undiscounted return achieved with the perturbed policy parameters used during data collection.
  • Figure 2: Top left -- Performance on Ant with different numbers of parallel environments, which equal the batch size. Top right -- Ablation on the different algorithmic changes introduced to scale . Every ablation uses all the same parameters as the best setup but with one change. Multiple policies: Update a set of 4096 differently initialized policy parameters instead of just one. 100x buffer size: Increase the replay buffer size to 409600 to sample older data as well. Fixed policy LR: Use a fixed learning rate of $1e-5$ for the policy instead of the learning rate schedule. 1.0 noise scale: Use a uniform noise scale of $1.0$ for perturbing the policy parameters instead of $0.3$. No weight clip: Do not clip the policy parameters. Clip perturbations: Directly clip the perturbations of the policy parameters to $(-0.3, 0.3)$ instead of only after the gradient steps. Gaussian noise: Use $\mathcal{N}(\mu = 0, \sigma = 1.0)$ instead of uniform noise for perturbing the policy parameters. Bottom -- Performance of the action-based policy parameter representations (Probing) and specialized weight-space architectures (). Additionally, the performance of the and baselines is shown as dashed lines.