Table of Contents
Fetching ...

Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

Yichen Chen, Mengdi Wang

TL;DR

A class of Stochastic Primal-Dual methods which exploit the inherent minimax duality of Bellman equations are proposed which use small storage and has low computational complexity per iteration.

Abstract

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration. The SPD methods find an absolute-$ε$-optimal policy, with high probability, using $\mathcal{O}\left(\frac{|\mathcal{S}|^4 |\mathcal{A}|^2σ^2 }{(1-γ)^6ε^2} \right)$ iterations/samples for the infinite-horizon discounted-reward MDP and $\mathcal{O}\left(\frac{|\mathcal{S}|^4 |\mathcal{A}|^2H^6σ^2 }{ε^2} \right)$ for the finite-horizon MDP.

Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

TL;DR

A class of Stochastic Primal-Dual methods which exploit the inherent minimax duality of Bellman equations are proposed which use small storage and has low computational complexity per iteration.

Abstract

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration. The SPD methods find an absolute--optimal policy, with high probability, using iterations/samples for the infinite-horizon discounted-reward MDP and for the finite-horizon MDP.

Paper Structure