Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence
Denis Zorba, David Šiška, Lukasz Szpruch
TL;DR
This paper addresses the problem of achieving stability and convergence for entropy-regularised MDPs in general spaces when using a discrete-time policy mirror descent actor-critic framework with TD-based critic updates. It introduces both a single-loop variant (one TD step per policy update) and a double-loop variant (multiple TD steps per update), proving uniform KL-boundedness, critic-behavior bounds, and convergence rates. The key results show sub-linear convergence under logarithmic TD-step growth and linear convergence under a concentrability assumption, with explicit stability conditions and finite-action-space corollaries. The findings close a theoretical gap for actor-critic methods in entropy-regularised MDPs on Polish spaces and lay groundwork for future sample-based extensions and non-linear function approximation analyses.
Abstract
We provide theoretical guarantees for convergence of discrete-time policy mirror descent with inexact advantage functions updated using temporal difference (TD) learning for entropy regularised MDPs in Polish state and action spaces. We rigorously derive sufficient conditions under which the single-loop actor-critic scheme is stable and convergent. To weaken these conditions, we introduce a variant that performs multiple TD steps per policy update and derive an explicit lower bound on the number of TD steps required to ensure stability. Finally, we establish sub-linear convergence when the number of TD steps grows logarithmically with the number of policy updates, and linear convergence when it grows linearly under a concentrability assumption.
