Recurrent Natural Policy Gradient for POMDPs
Semih Cayci, Atilla Eryilmaz
TL;DR
This work addresses the difficulty of solving POMDPs by tackling non-stationarity and perceptual aliasing through a memory-enabled Rec-NAC framework that combines IndRNN-based policies with recurrent TD critics and natural policy gradient updates. It develops a rigorous non-asymptotic analysis that links memory, smoothness, and optimization complexity, revealing regimes where long-range dependencies can cause exponential resource demands unless stabilized by max-norm projections and sufficient width. The authors introduce an infinite-width IndRNN function class via neural tangent features, establish finite-time bounds for Rec-TD and Rec-NPG, and extend compatible function approximation to non-stationary, history-dependent policies in POMDPs. Overall, the paper provides principled guidance on memory-capacity and architectural choices for provably effective and efficient policy optimization in partially observable environments, with potential practical impact for memory-enabled RL in complex sensing tasks.
Abstract
Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.
