Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

Xin Chen; Yifan Hu; Minda Zhao

Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

Xin Chen, Yifan Hu, Minda Zhao

TL;DR

This work develops a unified Kurdyka-Łojasiewicz (KŁ) framework to establish global convergence for policy gradient methods solving finite-horizon MDPs with general state and action spaces. By verifying bounded gradients, the KŁ property of expected Q-values, and a sequential decomposition inequality, the paper shows that both exact and stochastic policy gradient methods converge to globally optimal policies with rates that scale polynomially in the planning horizon $T$; stochastic methods achieve an $\tilde{\mathcal{O}}(\epsilon^{-1})$ sample complexity. The framework applies to entropy-regularized tabular MDPs, linear quadratic regulators (LQR) with affine policies, multi-period inventory systems with Markov-modulated demands, and stochastic cash balance problems, providing the first known sample complexities for some of these settings. Across these applications, the KŁ constant depends polynomially on $T$, yielding non-asymptotic convergence guarantees despite nonconvex policy landscapes. The results offer a principled path to data-driven policy optimization in operations research contexts, with implications for scalable, provably optimal decision-making in dynamic systems.

Abstract

Policy gradient methods are widely used in reinforcement learning. Yet, the nonconvexity of policy optimization imposes significant challenges in understanding the global convergence of policy gradient methods. For a class of finite-horizon Markov Decision Processes (MDPs) with general state and action spaces, we develop a framework that provides a set of easily verifiable assumptions to ensure the Kurdyka-Lojasiewicz (KL) condition of the policy optimization. Leveraging the KL condition, policy gradient methods converge to the globally optimal policy with a non-asymptomatic rate despite nonconvexity. Our results find applications in various control and operations models, including entropy-regularized tabular MDPs, Linear Quadratic Regulator (LQR) problems, stochastic inventory models, and stochastic cash balance problems, for which we show an $ε$-optimal policy can be obtained using a sample size in $\tilde{\mathcal{O}}(ε^{-1})$ and polynomial in terms of the planning horizon by stochastic policy gradient methods. Our result establishes the first sample complexity for multi-period inventory systems with Markov-modulated demands and stochastic cash balance problems in the literature.

Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

TL;DR

; stochastic methods achieve an

sample complexity. The framework applies to entropy-regularized tabular MDPs, linear quadratic regulators (LQR) with affine policies, multi-period inventory systems with Markov-modulated demands, and stochastic cash balance problems, providing the first known sample complexities for some of these settings. Across these applications, the KŁ constant depends polynomially on

, yielding non-asymptotic convergence guarantees despite nonconvex policy landscapes. The results offer a principled path to data-driven policy optimization in operations research contexts, with implications for scalable, provably optimal decision-making in dynamic systems.

Abstract

-optimal policy can be obtained using a sample size in

and polynomial in terms of the planning horizon by stochastic policy gradient methods. Our result establishes the first sample complexity for multi-period inventory systems with Markov-modulated demands and stochastic cash balance problems in the literature.

Paper Structure (47 sections, 27 theorems, 206 equations)

This paper contains 47 sections, 27 theorems, 206 equations.

Introduction
Related Literature
Organizations
Notations and Definitions
Problem Formulation
Bellman Equation
Policy Gradient Formulation
Landscape Characterization
Definition and Properties of KŁ Condition
Convergence Rate under KŁ Condition
KŁ Condition in Policy Gradient Formulation
Entropy-Regularized Tabular MDPs
Problem Formulation
KŁ condition of Policy Gradient Objectives
Linear Quadratic Regulator
...and 32 more sections

Key Result

Proposition 1

Consider a convex and compact set ${\mathcal{X}}\subseteq{\mathbb R}^n$. Suppose a function $f:{\mathcal{X}}\to{\mathbb R}$ satisfies the KŁ condition with a KŁ constant $\mu>0$ over ${\mathcal{X}}$. Then, any point satisfying the first-order necessary optimality condition of the optimization proble

Theorems & Definitions (36)

Definition 1: Fréchet Subdifferential rockafellar2009variational
Definition 2: Limiting Subdifferential mordukhovich1976maximum
Definition 3: KŁ Condition
Remark 1
Proposition 1: karimi2016linear
Lemma 1
Remark 2
Theorem 1
Remark 3
Lemma 2
...and 26 more

Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

TL;DR

Abstract

Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (36)