Table of Contents
Fetching ...

Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch

Weizhen Wang, Jianping He, Xiaoming Duan

TL;DR

This paper investigates the distribution mismatch inherent in on-policy policy gradient methods for discounted RL. It develops a theoretical framework showing that, under tabular parameterizations, the bias from mismatch does not prevent global optimality, and extends the analysis to general parameterizations by deriving mismatch bounds that shrink as the discount factor $\gamma$ approaches 1. A finite-time convergence bound for biased policy gradient is established under mild assumptions, providing insight into why biased updates often perform robustly in practice. Numerical experiments on continuing and episodic tasks corroborate the theory, showing biased and unbiased PG converging to the same optimum and highlighting reduced bias as $\gamma$ grows. The results help bridge the gap between theoretical policy gradient guarantees and practical implementations that rely on biased gradient estimates.

Abstract

Policy gradient methods are one of the most successful methods for solving challenging reinforcement learning problems. However, despite their empirical successes, many SOTA policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on the policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the methods under the mismatch remain globally optimal. Then, we extend this analysis to more general parameterizations by leveraging the theory of biased stochastic gradient descent. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.

Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch

TL;DR

This paper investigates the distribution mismatch inherent in on-policy policy gradient methods for discounted RL. It develops a theoretical framework showing that, under tabular parameterizations, the bias from mismatch does not prevent global optimality, and extends the analysis to general parameterizations by deriving mismatch bounds that shrink as the discount factor approaches 1. A finite-time convergence bound for biased policy gradient is established under mild assumptions, providing insight into why biased updates often perform robustly in practice. Numerical experiments on continuing and episodic tasks corroborate the theory, showing biased and unbiased PG converging to the same optimum and highlighting reduced bias as grows. The results help bridge the gap between theoretical policy gradient guarantees and practical implementations that rely on biased gradient estimates.

Abstract

Policy gradient methods are one of the most successful methods for solving challenging reinforcement learning problems. However, despite their empirical successes, many SOTA policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on the policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the methods under the mismatch remain globally optimal. Then, we extend this analysis to more general parameterizations by leveraging the theory of biased stochastic gradient descent. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.

Paper Structure

This paper contains 19 sections, 7 theorems, 66 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

For the direct policy paremeterization, let $J^*$ and $\pi_*$ denote the optimal objective value and an optimal policy, respectively. Then, for any policy $\pi$, Here, $\kappa=\frac{1}{1-\gamma}$ in the continuing setting, and in episodic tasks, $\kappa$ is a scale factor associated with $\pi_*$.

Figures (6)

  • Figure 1: Direct policy parameterization results for the Jack's car rental problem under different choices of $\gamma$.
  • Figure 2: Tabular softmax policy results for the Jack's car rental problem under different choices of $\gamma$.
  • Figure 3: Direct policy parameterization results for the gridworld problem under different choices of $\gamma$.
  • Figure 4: Tabular softmax policy results for the gridworld problem under different choices of $\gamma$.
  • Figure 5: Gridworld
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 1: Gradient domination
  • Theorem 2: Global convergence
  • Remark 1: Convergence under general parameterizations
  • Theorem 3: Mismatch bound for episodic MDPs
  • Theorem 4: Mismatch bound for continuing MDPs
  • Theorem 5: Convergence bound
  • Lemma 1: Performance difference lemma SK-JL:2002
  • Definition 1: Total variation distance
  • Theorem 6: Convergence Theorem DL-YP:2017