Table of Contents
Fetching ...

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

Tejaram Sangadi, L. A. Prashanth, Krishna Jagannathan

TL;DR

These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

Abstract

Motivated by applications in risk-sensitive reinforcement learning, we study mean-variance optimization in a discounted reward Markov Decision Process (MDP). Specifically, we analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging, both with and without regularization. Our bounds exhibit an exponentially decaying dependence on the initial error and a convergence rate of $O(1/t)$ after $t$ iterations. Moreover, for the regularized TD variant, our bound holds for a universal step size. Next, we integrate a Simultaneous Perturbation Stochastic Approximation (SPSA)-based actor update with an LFA critic and establish an $O(n^{-1/4})$ convergence guarantee, where $n$ denotes the iterations of the SPSA-based actor-critic algorithm. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

TL;DR

These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

Abstract

Motivated by applications in risk-sensitive reinforcement learning, we study mean-variance optimization in a discounted reward Markov Decision Process (MDP). Specifically, we analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging, both with and without regularization. Our bounds exhibit an exponentially decaying dependence on the initial error and a convergence rate of after iterations. Moreover, for the regularized TD variant, our bound holds for a universal step size. Next, we integrate a Simultaneous Perturbation Stochastic Approximation (SPSA)-based actor update with an LFA critic and establish an convergence guarantee, where denotes the iterations of the SPSA-based actor-critic algorithm. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.
Paper Structure (28 sections, 18 theorems, 190 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 28 sections, 18 theorems, 190 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 3.1

Suppose asm:stationaryasm:iidNoise hold. Run TD updates in eq:v-td-update for $t$ iterations with a step size $\beta$ satisfying the following constraint: $\beta \leq \beta_{\max}= \frac{\mu}{c}$ where $\mu=\lambda_{\mathsf{min}}(\tfrac{\mathbf{M}^{\top}+\mathbf{M}}{2})$ and $c = \max \{ 4 (\phi^{v where $w_0$ is the initial parameter, $\bar{w}$ is the TD fixed point, $z_{0} = w_{0}-\bar{w}$ is i

Figures (1)

  • Figure 1: Logical dependency graph for proving \ref{['thm:actor']}. Rectangular nodes (blue) represent established results from prior work, elliptical nodes (green) denote our novel contributions, and dashed lines illustrate the logical dependencies we establish to derive the final result (green circle).

Theorems & Definitions (36)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Lemma 4.1
  • Lemma 4.2
  • Theorem 4.3
  • Remark 1
  • Remark 2
  • ...and 26 more