Table of Contents
Fetching ...

Policy Gradient Guidance Enables Test Time Control

Jianing Qi, Hao Tang, Zhigang Zhu

TL;DR

This work introduces Policy Gradient Guidance (PGG), the first extension of classifier-free guidance to classical on-policy reinforcement learning. By augmenting PPO with an unconditional action prior and a $\gamma$-weighted interpolation between conditional and unconditional updates, PGG provides a straightforward test-time control knob without retraining. The authors prove that the normalization term in the gradient cancels under advantage estimation, yielding a clean guided policy gradient update, and validate the approach on discrete and continuous control benchmarks. Empirically, conditioning dropout helps in simple discrete tasks but harms continuous control, while training with modestly larger guidance $\gamma>1$ yields improved stability, sample efficiency, and controllability, signaling a bridge between diffusion guidance and standard online RL methods. Overall, PGG demonstrates that guidance mechanisms from diffusion models can generalize to traditional policy gradient methods, opening new directions for controllable online reinforcement learning.

Abstract

We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates behavior without retraining. We provide a theoretical derivation showing that the additional normalization term vanishes under advantage estimation, leading to a clean guided policy gradient update. Empirically, we evaluate PGG on discrete and continuous control benchmarks. We find that conditioning dropout-central to diffusion guidance-offers gains in simple discrete tasks and low sample regimes, but dropout destabilizes continuous control. Training with modestly larger guidance ($γ>1$) consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.

Policy Gradient Guidance Enables Test Time Control

TL;DR

This work introduces Policy Gradient Guidance (PGG), the first extension of classifier-free guidance to classical on-policy reinforcement learning. By augmenting PPO with an unconditional action prior and a -weighted interpolation between conditional and unconditional updates, PGG provides a straightforward test-time control knob without retraining. The authors prove that the normalization term in the gradient cancels under advantage estimation, yielding a clean guided policy gradient update, and validate the approach on discrete and continuous control benchmarks. Empirically, conditioning dropout helps in simple discrete tasks but harms continuous control, while training with modestly larger guidance yields improved stability, sample efficiency, and controllability, signaling a bridge between diffusion guidance and standard online RL methods. Overall, PGG demonstrates that guidance mechanisms from diffusion models can generalize to traditional policy gradient methods, opening new directions for controllable online reinforcement learning.

Abstract

We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates behavior without retraining. We provide a theoretical derivation showing that the additional normalization term vanishes under advantage estimation, leading to a clean guided policy gradient update. Empirically, we evaluate PGG on discrete and continuous control benchmarks. We find that conditioning dropout-central to diffusion guidance-offers gains in simple discrete tasks and low sample regimes, but dropout destabilizes continuous control. Training with modestly larger guidance () consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.

Paper Structure

This paper contains 35 sections, 20 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Discrete PPO with Dropout 10%. We compare the performance across training steps vs the reward, and we increase the guidance strength of the model from $\gamma=1$ to $\gamma=20$ during inference.
  • Figure 2: Continuous PPO with Dropout 10%. $\gamma$ value increases as represented from blue to green. We can see dropout resulted mixed results, and increase $\gamma$ might hurt the performance.
  • Figure 3: Discrete PPO with training $\gamma=1.1$
  • Figure 4: Continuous PPO with training $\gamma=1.1$. $\gamma$ value increases as represented from blue to green. We can see increase $\gamma$ can improve performance in most cases until $\gamma=1.3$, which there is a sharp drop in more complex cases. In all cases, an optimal $\gamma$ can outperform PPO..