Table of Contents
Fetching ...

Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model

Yang Guan, Jingliang Duan, Shengbo Eben Li, Jie Li, Jianyu Chen, Bo Cheng

TL;DR

This paper addresses the trade-off between data-driven and model-driven policy gradients in reinforcement learning by introducing Mixed Policy Gradient (MPG), a weighted combination of both sources that aims to achieve fast convergence without sacrificing asymptotic performance. It derives a unified PG framework, analyzes the biases from model/predictive errors and value estimation, and proposes a rule-based adaptive weighting scheme that shifts emphasis from data-driven to model-driven components as learning progresses. Two MPG variants for value learning (MPG-v1 with n-step TD and MPG-v2 with clipped double Q) are introduced, along with an asynchronous architecture to boost update throughput. Empirical results on path tracking and inverted pendulum show MPG outperforms strong baselines in both speed and final performance, while the asynchronous design maintains practical training efficiency.

Abstract

Reinforcement learning (RL) shows great potential in sequential decision-making. At present, mainstream RL algorithms are data-driven, which usually yield better asymptotic performance but much slower convergence compared with model-driven methods. This paper proposes mixed policy gradient (MPG) algorithm, which fuses the empirical data and the transition model in policy gradient (PG) to accelerate convergence without performance degradation. Formally, MPG is constructed as a weighted average of the data-driven and model-driven PGs, where the former is the derivative of the learned Q-value function, and the latter is that of the model-predictive return. To guide the weight design, we analyze and compare the upper bound of each PG error. Relying on that, a rule-based method is employed to heuristically adjust the weights. In particular, to get a better PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.

Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model

TL;DR

This paper addresses the trade-off between data-driven and model-driven policy gradients in reinforcement learning by introducing Mixed Policy Gradient (MPG), a weighted combination of both sources that aims to achieve fast convergence without sacrificing asymptotic performance. It derives a unified PG framework, analyzes the biases from model/predictive errors and value estimation, and proposes a rule-based adaptive weighting scheme that shifts emphasis from data-driven to model-driven components as learning progresses. Two MPG variants for value learning (MPG-v1 with n-step TD and MPG-v2 with clipped double Q) are introduced, along with an asynchronous architecture to boost update throughput. Empirical results on path tracking and inverted pendulum show MPG outperforms strong baselines in both speed and final performance, while the asynchronous design maintains practical training efficiency.

Abstract

Reinforcement learning (RL) shows great potential in sequential decision-making. At present, mainstream RL algorithms are data-driven, which usually yield better asymptotic performance but much slower convergence compared with model-driven methods. This paper proposes mixed policy gradient (MPG) algorithm, which fuses the empirical data and the transition model in policy gradient (PG) to accelerate convergence without performance degradation. Formally, MPG is constructed as a weighted average of the data-driven and model-driven PGs, where the former is the derivative of the learned Q-value function, and the latter is that of the model-predictive return. To guide the weight design, we analyze and compare the upper bound of each PG error. Relying on that, a rule-based method is employed to heuristically adjust the weights. In particular, to get a better PG, the weight of the data-driven PG is designed to grow along the learning process while the other to decrease. Simulation results show that the MPG method achieves the best asymptotic performance and convergence speed compared with other baseline algorithms.

Paper Structure

This paper contains 25 sections, 4 theorems, 41 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

The data-driven PG has the following relation with the unified PG: and if $f=p$ and the $\rho^{\pi_{\theta}}$ is the stationary distribution, the model-driven PG and the unified PG has the following equivalence:

Figures (6)

  • Figure 1: Interpretation of MPG. The loss of data-driven PG converges to the true loss along training process but has large variance in the earlier state. The loss of model-driven PG has low variance but is biased. Mixed PG dynamically adjusts the weights of the data loss and the model loss to construct a better approximation of the true loss function.
  • Figure 2: Asynchronous learning architecture. Buffers, Actors and Learners are all distributed across multiple processes to improve the efficiency of replay, sampling, and PG computation. The time consumption per iteration of MPG can be significantly reduced by employing more Learners.
  • Figure 3: Tasks. (a) Path tracking task: $(\mathcal{S},\mathcal{A})\subset\mathbb{R}^6\times\mathbb{R}^2$. (b) Inverted pendulum task: $(\mathcal{S},\mathcal{A})\subset\mathbb{R}^4\times\mathbb{R}^1$.
  • Figure 4: Algorithm comparison in terms of asymptotic performance and convergence speed on the path tracking task. (a) Training curves. The dashed line shows the minimum requirement for the task to work, which is -30 in the task. (b) Convergence speed of different algorithms. The x-coordinate is different goal performance, i.e., episode return, and the y-coordinate is the iteration number needed to reach the goal. The missing part of the curves on some goal performances means that the algorithms never reached these goals during the training process. The solid lines correspond to the mean and the shaded regions correspond to 95% confidence interval over 5 runs.
  • Figure 5: Algorithm comparison in terms of asymptotic performance and convergence speed on the inverted pendulum task. (a) Training curves. The dashed line shows the minimum requirement for the task to work, which is -2 in the task. (b) Convergence speed of different algorithms. The MPG-v1 and $n$-step DPG fail on this task because of the $n$-step TD value learning, so they are not plotted. The solid lines correspond to the mean and the shaded regions correspond to 95% confidence interval over 5 runs.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 2
  • proof