Table of Contents
Fetching ...

Unified Policy Value Decomposition for Rapid Adaptation

Cristiano Capone, Luca Falorsi, Andrea Ciardiello, Luca Manneschi

Abstract

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

Unified Policy Value Decomposition for Rapid Adaptation

Abstract

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.
Paper Structure (13 sections, 20 equations, 3 figures)

This paper contains 13 sections, 20 equations, 3 figures.

Figures (3)

  • Figure 1: MLP-based bilinear actor--critic architecture.A. Scheme of our architecture: the actor and critic are decomposed into $K$ parallel basis modules, policy primitives $Y_k(s)$ and value components $\phi_k(s,a)$. B--C. Comparison between the learning curves of a traditional architecture (2-layer MLP) and a single-layer MLP with bilinear decomposition. (inset) Scheme of the navigation task: a robot with 8 DOF is asked to move in a specific direction. D. Comparison of learning curves between the cases in which the latent space $G_k$ is independent or shared between actor and critic. E--F. Direction encoding for actor and critic. G--I. Reward, correlation between actor and critic $G$, and direction encoding in the $G$ space, as functions of training steps.
  • Figure 2: Zero-shot learning.A. Task scheme: the MuJoCo Ant agent is pre-trained on target directions and tested on new ones. Pretrained bilinear agent is evaluated on unseen goal directions (or task descriptors) without any parameter update, by conditioning on $g$. B. Performance compared against baselines when switching to novel directions. C--D. Behavior trajectories for training and test directions, respectively, illustrating successful generalization to intermediate angles not explicitly trained. E. Direct comparison between train and test directions (averaged over trials).
  • Figure 3: Bilinear decomposition allows for interpretability and generalization.A. Manipulating individual gating coordinates $G_k$ produces consistent, semantically meaningful changes in behavior, indicating interpretable control axes. (Top) Average movement direction as a function of latent-space direction for three different amplitudes; (bottom) speed distribution for the same three amplitudes (color-coded). Notably, speed modulation emerges spontaneously, although training objectives were defined only over movement direction. B. Online adaptation of $G_k$, allowing real-time solution of the current task. Reward as a function of the target direction (blue). As a reference, a task with negative reward is shown (orange).