Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model
Yang Guan, Jingliang Duan, Shengbo Eben Li, Jie Li, Jianyu Chen, Bo Cheng
TL;DR
This paper addresses the trade-off between data-driven and model-driven policy gradients in reinforcement learning by introducing Mixed Policy Gradient (MPG), a weighted combination of both sources that aims to achieve fast convergence without sacrificing asymptotic performance. It derives a unified PG framework, analyzes the biases from model/predictive errors and value estimation, and proposes a rule-based adaptive weighting scheme that shifts emphasis from data-driven to model-driven components as learning progresses. Two MPG variants for value learning (MPG-v1 with n-step TD and MPG-v2 with clipped double Q) are introduced, along with an asynchronous architecture to boost update throughput. Empirical results on path tracking and inverted pendulum show MPG outperforms strong baselines in both speed and final performance, while the asynchronous design maintains practical training efficiency.
Abstract
Reinforcement learning (RL) shows great potential in sequential decision-making. At present, mainstream RL algorithms are data-driven, which usually yield better asymptotic performance but much slower convergence compared with model-driven methods. This paper proposes mixed policy gradient (MPG) algorithm, which fuses the empirical data and the transition model in policy gradient (PG) to accelerate convergence without performance degradation. Formally, MPG is constructed as a weighted average of the data-driven and model-driven PGs, where the former is the derivative of the learned Q-value function, and the latter is that of the model-predictive return. To guide the weight design, we analyze and compare the upper bound of each PG error. Relying on that, a rule-based method is employed to heuristically adjust the weights. In particular, to get a better PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.
