Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch

Weizhen Wang; Jianping He; Xiaoming Duan

Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch

Weizhen Wang, Jianping He, Xiaoming Duan

TL;DR

This paper investigates the distribution mismatch inherent in on-policy policy gradient methods for discounted RL. It develops a theoretical framework showing that, under tabular parameterizations, the bias from mismatch does not prevent global optimality, and extends the analysis to general parameterizations by deriving mismatch bounds that shrink as the discount factor $\gamma$ approaches 1. A finite-time convergence bound for biased policy gradient is established under mild assumptions, providing insight into why biased updates often perform robustly in practice. Numerical experiments on continuing and episodic tasks corroborate the theory, showing biased and unbiased PG converging to the same optimum and highlighting reduced bias as $\gamma$ grows. The results help bridge the gap between theoretical policy gradient guarantees and practical implementations that rely on biased gradient estimates.

Abstract

Policy gradient methods are one of the most successful methods for solving challenging reinforcement learning problems. However, despite their empirical successes, many SOTA policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on the policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the methods under the mismatch remain globally optimal. Then, we extend this analysis to more general parameterizations by leveraging the theory of biased stochastic gradient descent. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.

Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch

TL;DR

Abstract

Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)