Table of Contents
Fetching ...

Recurrent Natural Policy Gradient for POMDPs

Semih Cayci, Atilla Eryilmaz

TL;DR

This work addresses the difficulty of solving POMDPs by tackling non-stationarity and perceptual aliasing through a memory-enabled Rec-NAC framework that combines IndRNN-based policies with recurrent TD critics and natural policy gradient updates. It develops a rigorous non-asymptotic analysis that links memory, smoothness, and optimization complexity, revealing regimes where long-range dependencies can cause exponential resource demands unless stabilized by max-norm projections and sufficient width. The authors introduce an infinite-width IndRNN function class via neural tangent features, establish finite-time bounds for Rec-TD and Rec-NPG, and extend compatible function approximation to non-stationary, history-dependent policies in POMDPs. Overall, the paper provides principled guidance on memory-capacity and architectural choices for provably effective and efficient policy optimization in partially observable environments, with potential practical impact for memory-enabled RL in complex sensing tasks.

Abstract

Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.

Recurrent Natural Policy Gradient for POMDPs

TL;DR

This work addresses the difficulty of solving POMDPs by tackling non-stationarity and perceptual aliasing through a memory-enabled Rec-NAC framework that combines IndRNN-based policies with recurrent TD critics and natural policy gradient updates. It develops a rigorous non-asymptotic analysis that links memory, smoothness, and optimization complexity, revealing regimes where long-range dependencies can cause exponential resource demands unless stabilized by max-norm projections and sufficient width. The authors introduce an infinite-width IndRNN function class via neural tangent features, establish finite-time bounds for Rec-TD and Rec-NPG, and extend compatible function approximation to non-stationary, history-dependent policies in POMDPs. Overall, the paper provides principled guidance on memory-capacity and architectural choices for provably effective and efficient policy optimization in partially observable environments, with potential practical impact for memory-enabled RL in complex sensing tasks.

Abstract

Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.
Paper Structure (26 sections, 15 theorems, 191 equations, 4 figures, 2 algorithms)

This paper contains 26 sections, 15 theorems, 191 equations, 4 figures, 2 algorithms.

Key Result

Theorem 5.4

Under Assumptions assumption:sampling-oracle-assumption:representation-td, for any projection radius $\rho \succeq \nu=(\nu_\mathsf{w},\nu_\mathsf{u})$ and step-size $\eta > 0$, Rec-TD with max-norm projection achieves the following error bound: for any $K\in\mathbb{N}$, where are instance-dependent constants that do not depend on $K$, and $\omega_{t,k}:=\sqrt{\mathbb{E}[(F_t(\bar{Z}_t;\Theta(k)

Figures (4)

  • Figure 1: An independently recurrent neural network (IndRNN) in the RL context.
  • Figure 2: Mean-squared TD and (mean) parameter deviation under Rec-TD for the case $p_\mathsf{exp}=0.8$ and $\gamma = 0.9$. The mean curve and confidence intervals (90%) stem from 5 trials.
  • Figure 3: Mean-squared TD and (mean) parameter deviation under Rec-TD for the case $p_\mathsf{exp}=0.25$ and $\gamma = 0.9$. The mean curve and confidence intervals (90%) stem from 5 trials.
  • Figure 4: MSTD performance with $m=256$ with various sequence lengths $T$ with $p_\mathsf{exp}=0.25$. Increasing $T$ implies larger MSTD.

Theorems & Definitions (44)

  • Definition 2.1: Admissible policy
  • Definition 2.2: Value function, $\mathcal{Q}$-function, advantage function
  • Remark 2.3: Curse of history in RL for POMDPs
  • Definition 3.1: Symmetric random initialization
  • Definition 3.2: Transportation mapping
  • Definition 3.3: Reference function class for IndRNNs
  • Remark 3.4: Reduction to $\mathscr{F}_\mathrm{NTK}$
  • Remark 3.5: Fully-connected RNNs
  • Remark 5.2: Intuition behind Rec-TD
  • Theorem 5.4: Finite-time bounds for Rec-TD
  • ...and 34 more