Policy Optimization Algorithms in a Unified Framework
Shuang Wu
TL;DR
This work addresses the integration of policy optimization across discounted, total, and average reward objectives by introducing generalized ergodicity, a space-based formulation that uses invariant measures to unify time- and space-averaged quantities. Coupled with perturbation analysis, it rederives and connects major algorithms (policy iteration, policy gradient, natural policy gradient, TRPO, PPO) within a single framework, and provides concrete guidance to avoid common implementation errors. The paper demonstrates the approach through an LQR case study and numerical experiments, highlighting the impact of design choices such as discounting and ergodicity on learning dynamics and convergence. Together, these contributions offer a principled, accessible toolkit for correctly deriving and implementing policy optimization methods in diverse MDP settings, with practical relevance for RL, control, and related AI systems.
Abstract
Policy optimization algorithms are crucial in many fields but challenging to grasp and implement, often due to complex calculations related to Markov decision processes and varying use of discount and average reward setups. This paper presents a unified framework that applies generalized ergodicity theory and perturbation analysis to clarify and enhance the application of these algorithms. Generalized ergodicity theory sheds light on the steady-state behavior of stochastic processes, aiding understanding of both discounted and average rewards. Perturbation analysis provides in-depth insights into the fundamental principles of policy optimization algorithms. We use this framework to identify common implementation errors and demonstrate the correct approaches. Through a case study on Linear Quadratic Regulator problems, we illustrate how slight variations in algorithm design affect implementation outcomes. We aim to make policy optimization algorithms more accessible and reduce their misuse in practice.
