Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence

Denis Zorba; David Šiška; Lukasz Szpruch

Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence

Denis Zorba, David Šiška, Lukasz Szpruch

TL;DR

This paper addresses the problem of achieving stability and convergence for entropy-regularised MDPs in general spaces when using a discrete-time policy mirror descent actor-critic framework with TD-based critic updates. It introduces both a single-loop variant (one TD step per policy update) and a double-loop variant (multiple TD steps per update), proving uniform KL-boundedness, critic-behavior bounds, and convergence rates. The key results show sub-linear convergence under logarithmic TD-step growth and linear convergence under a concentrability assumption, with explicit stability conditions and finite-action-space corollaries. The findings close a theoretical gap for actor-critic methods in entropy-regularised MDPs on Polish spaces and lay groundwork for future sample-based extensions and non-linear function approximation analyses.

Abstract

We provide theoretical guarantees for convergence of discrete-time policy mirror descent with inexact advantage functions updated using temporal difference (TD) learning for entropy regularised MDPs in Polish state and action spaces. We rigorously derive sufficient conditions under which the single-loop actor-critic scheme is stable and convergent. To weaken these conditions, we introduce a variant that performs multiple TD steps per policy update and derive an explicit lower bound on the number of TD steps required to ensure stability. Finally, we establish sub-linear convergence when the number of TD steps grows logarithmically with the number of policy updates, and linear convergence when it grows linearly under a concentrability assumption.

Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence

TL;DR

Abstract

Paper Structure (29 sections, 22 theorems, 185 equations, 2 algorithms)

This paper contains 29 sections, 22 theorems, 185 equations, 2 algorithms.

Introduction
Related works
Contributions
Entropy regularised Markov Decision Processes
Mirror Descent and Temporal Difference
Single loop actor-critic
Stability
Convergence
Double loop actor-critic
Stability
Convergence
Conclusion and future directions
Notation
Technical Details
Proofs of Section \ref{['sec:single_loop']}
...and 14 more sections

Key Result

Lemma 3.1

Let Assumption as:e_value and as:bounded_phi hold. Let $0<h \leq \frac{\Gamma}{6(1+\gamma)^2}$ and for some $\theta^0 = \theta_0 \in \mathbb{R}^{N}$ and $\pi^0 = \pi_0 \in \Pi_{\mu}$, let $\{\theta^{n},\pi^n\}_{n\in \mathbb{N}}$ be the iterates for Algorithm algo:single_loop_ac. Then for all $n \in

Theorems & Definitions (45)

Definition 1.1: Admissible Policies
Definition 2.1
Definition 2.2
Remark 2.3
Lemma 3.1
Corollary 3.2
Lemma 3.3
Theorem 3.4
Remark 3.5
Theorem 3.6
...and 35 more

Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence

TL;DR

Abstract

Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (45)