Recurrent Natural Policy Gradient for POMDPs

Semih Cayci; Atilla Eryilmaz

Recurrent Natural Policy Gradient for POMDPs

Semih Cayci, Atilla Eryilmaz

TL;DR

This work addresses the difficulty of solving POMDPs by tackling non-stationarity and perceptual aliasing through a memory-enabled Rec-NAC framework that combines IndRNN-based policies with recurrent TD critics and natural policy gradient updates. It develops a rigorous non-asymptotic analysis that links memory, smoothness, and optimization complexity, revealing regimes where long-range dependencies can cause exponential resource demands unless stabilized by max-norm projections and sufficient width. The authors introduce an infinite-width IndRNN function class via neural tangent features, establish finite-time bounds for Rec-TD and Rec-NPG, and extend compatible function approximation to non-stationary, history-dependent policies in POMDPs. Overall, the paper provides principled guidance on memory-capacity and architectural choices for provably effective and efficient policy optimization in partially observable environments, with potential practical impact for memory-enabled RL in complex sensing tasks.

Abstract

Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.

Recurrent Natural Policy Gradient for POMDPs

TL;DR

Abstract

Paper Structure (26 sections, 15 theorems, 191 equations, 4 figures, 2 algorithms)

This paper contains 26 sections, 15 theorems, 191 equations, 4 figures, 2 algorithms.

Introduction
Q$_1$. How can we achieve (i) provably effective and (ii) computation/memory-efficient policy evaluation for non-stationary policies in partially observable environments?
Q$_2$. How can we parameterize non-stationary policies by a rich and practically feasible class of RNNs and perform efficient policy optimization?
Q$_3$. What are the memory, computation and sample complexities of the resulting Rec-NAC method, which employs Rec-NPG for policy updates and Rec-TD for policy evaluation?
Previous work
Notation
Preliminaries on Partially Observable Markov Decision Processes
Independently Recurrent Neural Network Architecture
Reference Function Class for Independently Recurrent Neural Networks
Max-Norm Projection for IndRNNs
Rec-NAC: A High-Level Algorithmic View
Critic: Recurrent Temporal Difference Learning (Rec-TD)
Recurrent TD Learning Algorithm
Theoretical Analysis of Rec-TD: Finite-Time Bounds and Global Near-Optimality
Actor: Recurrent Natural Policy Gradient (Rec-NPG) for POMDPs
...and 11 more sections

Key Result

Theorem 5.4

Under Assumptions assumption:sampling-oracle-assumption:representation-td, for any projection radius $\rho \succeq \nu=(\nu_\mathsf{w},\nu_\mathsf{u})$ and step-size $\eta > 0$, Rec-TD with max-norm projection achieves the following error bound: for any $K\in\mathbb{N}$, where are instance-dependent constants that do not depend on $K$, and $\omega_{t,k}:=\sqrt{\mathbb{E}[(F_t(\bar{Z}_t;\Theta(k)

Figures (4)

Figure 1: An independently recurrent neural network (IndRNN) in the RL context.
Figure 2: Mean-squared TD and (mean) parameter deviation under Rec-TD for the case $p_\mathsf{exp}=0.8$ and $\gamma = 0.9$. The mean curve and confidence intervals (90%) stem from 5 trials.
Figure 3: Mean-squared TD and (mean) parameter deviation under Rec-TD for the case $p_\mathsf{exp}=0.25$ and $\gamma = 0.9$. The mean curve and confidence intervals (90%) stem from 5 trials.
Figure 4: MSTD performance with $m=256$ with various sequence lengths $T$ with $p_\mathsf{exp}=0.25$. Increasing $T$ implies larger MSTD.

Theorems & Definitions (44)

Definition 2.1: Admissible policy
Definition 2.2: Value function, $\mathcal{Q}$-function, advantage function
Remark 2.3: Curse of history in RL for POMDPs
Definition 3.1: Symmetric random initialization
Definition 3.2: Transportation mapping
Definition 3.3: Reference function class for IndRNNs
Remark 3.4: Reduction to $\mathscr{F}_\mathrm{NTK}$
Remark 3.5: Fully-connected RNNs
Remark 5.2: Intuition behind Rec-TD
Theorem 5.4: Finite-time bounds for Rec-TD
...and 34 more

Recurrent Natural Policy Gradient for POMDPs

TL;DR

Abstract

Recurrent Natural Policy Gradient for POMDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (44)