A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

Tejaram Sangadi; L. A. Prashanth; Krishna Jagannathan

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

Tejaram Sangadi, L. A. Prashanth, Krishna Jagannathan

TL;DR

These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

Abstract

Motivated by applications in risk-sensitive reinforcement learning, we study mean-variance optimization in a discounted reward Markov Decision Process (MDP). Specifically, we analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging, both with and without regularization. Our bounds exhibit an exponentially decaying dependence on the initial error and a convergence rate of $O(1/t)$ after $t$ iterations. Moreover, for the regularized TD variant, our bound holds for a universal step size. Next, we integrate a Simultaneous Perturbation Stochastic Approximation (SPSA)-based actor update with an LFA critic and establish an $O(n^{-1/4})$ convergence guarantee, where $n$ denotes the iterations of the SPSA-based actor-critic algorithm. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

TL;DR

These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

Abstract

after

iterations. Moreover, for the regularized TD variant, our bound holds for a universal step size. Next, we integrate a Simultaneous Perturbation Stochastic Approximation (SPSA)-based actor update with an LFA critic and establish an

convergence guarantee, where

denotes the iterations of the SPSA-based actor-critic algorithm. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

Paper Structure (28 sections, 18 theorems, 190 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 28 sections, 18 theorems, 190 equations, 1 figure, 1 table, 2 algorithms.

Introduction
Problem formulation
Mean-variance TD-critic
Basic algorithm.
Bounds for the TD-critic.
Mean-Squared Error Bounds.
Tail averaging.
Regularization for universal step size.
High-probability bounds.
Discussion:
SPSA-based Actor
Basic algorithm.
Need for SPSA.
Actor.
Critic.
...and 13 more sections

Key Result

Theorem 3.1

Suppose asm:stationaryasm:iidNoise hold. Run TD updates in eq:v-td-update for $t$ iterations with a step size $\beta$ satisfying the following constraint: $\beta \leq \beta_{\max}= \frac{\mu}{c}$ where $\mu=\lambda_{\mathsf{min}}(\tfrac{\mathbf{M}^{\top}+\mathbf{M}}{2})$ and $c = \max \{ 4 (\phi^{v where $w_0$ is the initial parameter, $\bar{w}$ is the TD fixed point, $z_{0} = w_{0}-\bar{w}$ is i

Figures (1)

Figure 1: Logical dependency graph for proving \ref{['thm:actor']}. Rectangular nodes (blue) represent established results from prior work, elliptical nodes (green) denote our novel contributions, and dashed lines illustrate the logical dependencies we establish to derive the final result (green circle).

Theorems & Definitions (36)

Theorem 3.1
Theorem 3.2
Theorem 3.3
Theorem 3.4
Theorem 3.5
Lemma 4.1
Lemma 4.2
Theorem 4.3
Remark 1
Remark 2
...and 26 more

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

TL;DR

Abstract

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (36)