Table of Contents
Fetching ...

GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An

TL;DR

GoRL addresses the enduring tension between stable online optimization and expressive, multimodal action modeling in reinforcement learning. By decoupling optimization from generation, it learns a tractable latent policy $\pi_\theta(\varepsilon|s)$ and a separate expressive decoder $g_\phi(s,\varepsilon)$, and employs a two-timescale alternating scheme to update them. The latent policy is optimized with standard policy gradients while the decoder is refined using likelihood-free generative objectives, yielding stable training and richer action distributions. Empirically, GoRL outperforms Gaussian baselines and prior generative-methods across six DMControl tasks, with HopperStand achieving >870 normalized return and clear evidence of emergent multimodality in actions. The work provides a practical, algorithm- and model-agnostic path to combining stability with expressiveness in online RL, and suggests directions for extending to off-policy and high-dimensional settings.

Abstract

Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

TL;DR

GoRL addresses the enduring tension between stable online optimization and expressive, multimodal action modeling in reinforcement learning. By decoupling optimization from generation, it learns a tractable latent policy and a separate expressive decoder , and employs a two-timescale alternating scheme to update them. The latent policy is optimized with standard policy gradients while the decoder is refined using likelihood-free generative objectives, yielding stable training and richer action distributions. Empirically, GoRL outperforms Gaussian baselines and prior generative-methods across six DMControl tasks, with HopperStand achieving >870 normalized return and clear evidence of emergent multimodality in actions. The work provides a practical, algorithm- and model-agnostic path to combining stability with expressiveness in online RL, and suggests directions for extending to off-policy and high-dimensional settings.

Abstract

Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

Paper Structure

This paper contains 72 sections, 4 theorems, 42 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Lemma 3.1

Unbiased latent policy gradient. Fix $\phi$, and let $\pi_{\theta,\phi}(a\mid s)$ be the action distribution induced by sampling $\varepsilon\sim\pi_\theta(\cdot\mid s)$ and executing $a=g_\phi(s,\varepsilon)$. Then which matches Eq. (eq:latent-pg). Thus, latent-space gradients are unbiased estimators of the true policy gradient of $\pi_{\theta,\phi}$.

Figures (7)

  • Figure 1: Illustration of the "mode-covering" problem: when the optimal policy (blue) is multimodal, a unimodal policy (red) assigns high probability to the low-reward region between modes, leading to suboptimal behavior.
  • Figure 2: Overview of the GoRL framework.(a) Latent optimization: The decoder $g_\phi$ is frozen while the encoder $\pi_\theta$ is optimized in the latent space using standard policy gradients, with a KL penalty toward $\mathcal{N}(0,I)$. (b) Decoder refinement: The encoder is frozen and the decoder $g_\phi$ is updated via supervised learning on recent rollouts, mapping the fixed Gaussian prior over $\varepsilon$ to actions using an expressive generative loss (e.g., flow matching).
  • Figure 3: Visual overview of the DMControl tasks. The benchmark covers diverse control challenges: high-speed locomotion (CheetahRun), bipedal gait control (WalkerWalk), object manipulation (FingerSpin, FingerTurnHard), and fine-grained stabilization with complex contacts (HopperStand, FishSwim).
  • Figure 4: Learning curves across six DMControl tasks. Curves are smoothed using Gaussian filtering ($\sigma = 100.0$) for visual clarity. Shaded regions denote standard deviation across five seeds.
  • Figure 5: Ablation of latent regularization on CheetahRun. Varying the KL coefficient $\beta$ significantly affects stability.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Lemma 3.1
  • proof
  • Lemma 3.2
  • Lemma : Unbiased Latent Policy Gradient
  • proof
  • Lemma : Performance under Small Latent Divergence
  • proof