Table of Contents
Fetching ...

Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective

Jinouwen Zhang, Rongkun Xue, Yazhe Niu, Yun Chen, Jing Yang, Hongsheng Li, Yu Liu

TL;DR

This work tackles the lack of a unified, RL-native approach to training generative policies for continuous-action tasks. It introduces two simple training schemes, GMPO and GMPG, that work with both diffusion and flow models and are integrated into a standardized GenerativeRL framework. Through extensive offline-RL experiments on D4RL and RL Unplugged benchmarks, GMPO and GMPG achieve competitive or state-of-the-art performance while offering stable training and explicit policy extraction. The results demonstrate that decoupling generative models from RL components and using advantage-weighted regression or RL-native policy gradients can yield practical, scalable generative policies for real-world robotics and control tasks.

Abstract

Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.

Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective

TL;DR

This work tackles the lack of a unified, RL-native approach to training generative policies for continuous-action tasks. It introduces two simple training schemes, GMPO and GMPG, that work with both diffusion and flow models and are integrated into a standardized GenerativeRL framework. Through extensive offline-RL experiments on D4RL and RL Unplugged benchmarks, GMPO and GMPG achieve competitive or state-of-the-art performance while offering stable training and explicit policy extraction. The results demonstrate that decoupling generative models from RL components and using advantage-weighted regression or RL-native policy gradients can yield practical, scalable generative policies for real-world robotics and control tasks.

Abstract

Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.

Paper Structure

This paper contains 56 sections, 45 equations, 5 figures, 14 tables, 3 algorithms.

Figures (5)

  • Figure 1: Log-Likelihood of D4RL datasets evaluated by GMPO/GMPG-GVP policies during training. HM stands for hopper-medium-v2, HME stands for halfcheetah-medium-expert-v2. Each point represents a model during training, with colors indicating different stages. The returns of the model are evaluated and averaged over five random seeds. Blue points denote the pretraining stage for GMPG and the training stage for GMPO, as GMPO does not require pretraining. Orange points indicate the finetuning stage for GMPG. The star marker shows the optimal model obtained during training. The density of the points reflects the number of models in that area.
  • Figure 2: 2D toy Swiss Roll dataset with assigned value function. Values range from $-3.5$ to $1.5$ as the spiral extends outward. Colors represent data point values. A small noise $\epsilon=0.6$ is added for better visualization.
  • Figure 3: Generation trajectories of models trained by GMPO and GMPG on the 2D toy Swiss Roll dataset. Colors indicate time stamps of data points during generation.
  • Figure 4: An example of using GenerativeRL for defining models, training, and sampling. All experiment configurations are orgnized in a nested dictionary and can be recorded for reproductions. Configuration of every component is modular and can be easily switched. Diverse generative models and neural network components are supported. User can switch between training objectives and inference strategies easily. Most functions support batch processing and automatic differentiation with only a few lines of code. This configuration is flexible and can be easily extended for different RL tasks.
  • Figure 5: Framework structure of GenerativeRL.