Policy Gradient Guidance Enables Test Time Control

Jianing Qi; Hao Tang; Zhigang Zhu

Policy Gradient Guidance Enables Test Time Control

Jianing Qi, Hao Tang, Zhigang Zhu

TL;DR

This work introduces Policy Gradient Guidance (PGG), the first extension of classifier-free guidance to classical on-policy reinforcement learning. By augmenting PPO with an unconditional action prior and a $\gamma$-weighted interpolation between conditional and unconditional updates, PGG provides a straightforward test-time control knob without retraining. The authors prove that the normalization term in the gradient cancels under advantage estimation, yielding a clean guided policy gradient update, and validate the approach on discrete and continuous control benchmarks. Empirically, conditioning dropout helps in simple discrete tasks but harms continuous control, while training with modestly larger guidance $\gamma>1$ yields improved stability, sample efficiency, and controllability, signaling a bridge between diffusion guidance and standard online RL methods. Overall, PGG demonstrates that guidance mechanisms from diffusion models can generalize to traditional policy gradient methods, opening new directions for controllable online reinforcement learning.

Abstract

We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates behavior without retraining. We provide a theoretical derivation showing that the additional normalization term vanishes under advantage estimation, leading to a clean guided policy gradient update. Empirically, we evaluate PGG on discrete and continuous control benchmarks. We find that conditioning dropout-central to diffusion guidance-offers gains in simple discrete tasks and low sample regimes, but dropout destabilizes continuous control. Training with modestly larger guidance ($γ>1$) consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.

Policy Gradient Guidance Enables Test Time Control

TL;DR

-weighted interpolation between conditional and unconditional updates, PGG provides a straightforward test-time control knob without retraining. The authors prove that the normalization term in the gradient cancels under advantage estimation, yielding a clean guided policy gradient update, and validate the approach on discrete and continuous control benchmarks. Empirically, conditioning dropout helps in simple discrete tasks but harms continuous control, while training with modestly larger guidance

yields improved stability, sample efficiency, and controllability, signaling a bridge between diffusion guidance and standard online RL methods. Overall, PGG demonstrates that guidance mechanisms from diffusion models can generalize to traditional policy gradient methods, opening new directions for controllable online reinforcement learning.

Abstract

) consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.

Policy Gradient Guidance Enables Test Time Control

TL;DR

Abstract

Policy Gradient Guidance Enables Test Time Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)