Learning POMDPs with Linear Function Approximation and Finite Memory
Ali Devran Kara
TL;DR
This work extends linear function approximation and finite-memory techniques to learning in partially observed MDPs (POMDPs). By formulating a finite-memory belief-MDP reduction and analyzing both policy evaluation and Q-value learning, the authors establish convergence results under ergodicity and projection-based conditions, with error bounds tied to filter stability and approximation quality. They identify relaxations in special cases, such as exact linearity or discretization-based bases, where convergence can be guaranteed with milder assumptions. The approach provides a principled framework to quantify the trade-offs between memory, discretization, and approximation in POMDP learning, with practical implications for scalable RL under partial observability.
Abstract
We study reinforcement learning with linear function approximation and finite-memory approximations for partially observed Markov decision processes (POMDPs). We first present an algorithm for the value evaluation of finite-memory feedback policies. We provide error bounds derived from filter stability and projection errors. We then study the learning of finite-memory based near-optimal Q values. Convergence in this case requires further assumptions on the exploration policy when using general basis functions. We then show that these assumptions can be relaxed for specific models such as those with perfectly linear cost and dynamics, or when using discretization based basis functions.
