Policy Optimization over General State and Action Spaces

Caleb Ju; Guanghui Lan

Policy Optimization over General State and Action Spaces

Caleb Ju, Guanghui Lan

TL;DR

This work addresses reinforcement learning with general state and action spaces by extending policy mirror descent (PMD) and introducing policy dual averaging (PDA) as scalable, provably convergent algorithms that accommodate function approximation without mandatory policy parameterization. It establishes linear convergence to global optima or sublinear convergence to stationary points under exact policy evaluation, and derives bounds on how policy-evaluation and approximation errors affect convergence in both finite and continuous action spaces. The authors show that PDA, in particular, can be more amenable to function approximation and offers robust performance in diverse RL tasks, including grid-world, Lunar Lander, inverted pendulum, and LQR settings. Empirical results indicate that the proposed methods are competitive with, and in some cases superior to, state-of-the-art RL algorithms, while maintaining theoretical guarantees. Overall, the paper broadens the applicability of policy-gradient-style methods to general-state RL problems with rigorous convergence analysis and practical approximation strategies.

Abstract

Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature. Preliminary numerical results demonstrate the robustness of the aforementioned methods and show they can be competitive with state-of-the-art RL algorithms.

Policy Optimization over General State and Action Spaces

TL;DR

Abstract

Paper Structure (23 sections, 23 theorems, 153 equations, 5 figures, 4 algorithms)

This paper contains 23 sections, 23 theorems, 153 equations, 5 figures, 4 algorithms.

Introduction
Notation and terminology
Problems of Interest
Markov Decision Processes
Performance Difference and Policy Gradient
Policy Mirror Descent
The Generic Algorithmic Scheme
Function Approximation in PMD
PMD for general state and finite action spaces
PMD for general state and continuous action spaces
Policy Dual Averaging
The Generic Algorithmic Scheme
Function Approximation in PDA
PDA for General State and Finite Action Spaces
PDA for General State and Continuous Action Spaces
...and 8 more sections

Key Result

Lemma 1

\newlabellem:performance_diff_deter0 Let $\pi$ and $\pi'$ be two feasible policies. Then we have where

Figures (5)

Figure 1: Mean score and 95% confidence interval (shaded region) in GridWorld.
Figure 2: Score for each of the ten seeds in GridWorld.
Figure 3: Score on Lunar Lander. See \ref{['fig:gw']} for more details on the plot.
Figure 4: Score on inverted pendulum.
Figure 5: Score on LQR. DDPG is not shown since its cost diverges.

Theorems & Definitions (44)

Lemma 1
Proof 1
Lemma 2
Proof 2
Lemma 3
Proof 3
Proposition 1
Proof 4
Theorem 1
Proof 5
...and 34 more

Policy Optimization over General State and Action Spaces

TL;DR

Abstract

Policy Optimization over General State and Action Spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (44)