Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective
Jinouwen Zhang, Rongkun Xue, Yazhe Niu, Yun Chen, Jing Yang, Hongsheng Li, Yu Liu
TL;DR
This work tackles the lack of a unified, RL-native approach to training generative policies for continuous-action tasks. It introduces two simple training schemes, GMPO and GMPG, that work with both diffusion and flow models and are integrated into a standardized GenerativeRL framework. Through extensive offline-RL experiments on D4RL and RL Unplugged benchmarks, GMPO and GMPG achieve competitive or state-of-the-art performance while offering stable training and explicit policy extraction. The results demonstrate that decoupling generative models from RL components and using advantage-weighted regression or RL-native policy gradients can yield practical, scalable generative policies for real-world robotics and control tasks.
Abstract
Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.
