Massively Scaling Explicit Policy-conditioned Value Functions
Nico Bohlinger, Jan Peters
TL;DR
The paper addresses the challenge of scaling Explicit Policy-conditioned Value Functions (EPVFs) for continuous-control tasks, where naive scaling can cause uncontrolled growth in policy representations and exploration inefficiency. It introduces a scaling strategy that combines massive GPU-based parallelization, large batch sizes, weight clipping, and scaled perturbations to train EPVFs effectively. Empirical results show EPVFs can reach competitive performance with state-of-the-art DRL baselines like PPO and SAC on tasks such as Cartpole and a custom Ant environment; input representations matter, with raw policy parameters generally outperforming action-based probes, though weight-space architectures can match or exceed baselines. The work demonstrates the viability of offline/off-policy EPVF learning at scale and highlights design choices—especially weight clipping and parameter-space exploration—that stabilize training and improve exploration, while pointing to future exploration of weight-space representations to further enhance scaling.
Abstract
We introduce a scaling strategy for Explicit Policy-Conditioned Value Functions (EPVFs) that significantly improves performance on challenging continuous-control tasks. EPVFs learn a value function V(θ) that is explicitly conditioned on the policy parameters, enabling direct gradient-based updates to the parameters of any policy. However, EPVFs at scale struggle with unrestricted parameter growth and efficient exploration in the policy parameter space. To address these issues, we utilize massive parallelization with GPU-based simulators, big batch sizes, weight clipping and scaled peturbations. Our results show that EPVFs can be scaled to solve complex tasks, such as a custom Ant environment, and can compete with state-of-the-art Deep Reinforcement Learning (DRL) baselines like Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). We further explore action-based policy parameter representations from previous work and specialized neural network architectures to efficiently handle weight-space features, which have not been used in the context of DRL before.
