Table of Contents
Fetching ...

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin

TL;DR

This paper introduces the critic-actor algorithm by reversing the conventional two-time-scale updates of tabular actor-critic, making the value function updates slower and the policy updates faster. It proves convergence of this CA scheme using two-time-scale stochastic approximation and ODE techniques, showing that CA emulates value iteration under the reversed timescales. Empirically, CA achieves accuracy and computational efficiency comparable to or better than standard actor-critic across tabular and function-approximation settings, including linear and neural-network architectures. The work broadens the RL algorithmic landscape by providing a theoretically sound alternative to actor-critic with potential benefits in convergence behavior and practicality for large-scale problems.

Abstract

We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

TL;DR

This paper introduces the critic-actor algorithm by reversing the conventional two-time-scale updates of tabular actor-critic, making the value function updates slower and the policy updates faster. It proves convergence of this CA scheme using two-time-scale stochastic approximation and ODE techniques, showing that CA emulates value iteration under the reversed timescales. Empirically, CA achieves accuracy and computational efficiency comparable to or better than standard actor-critic across tabular and function-approximation settings, including linear and neural-network architectures. The work broadens the RL algorithmic landscape by providing a theoretically sound alternative to actor-critic with potential benefits in convergence behavior and practicality for large-scale problems.

Abstract

We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
Paper Structure (8 sections, 5 theorems, 40 equations, 11 figures, 1 table)

This paper contains 8 sections, 5 theorems, 40 equations, 11 figures, 1 table.

Key Result

Lemma 1

The sequences converge almost surely as $m\rightarrow\infty$.

Figures (11)

  • Figure 1: $|S|=1000,|U|=6,\alpha_1=1,\beta_1=0.55,\alpha_2=1,\beta_2=0.55$
  • Figure 2: $|S|=1000,|U|=6,\alpha_1=0.95,\beta_1=0.75,\alpha_2=0.75,\beta_2=0.55$
  • Figure 3: $|S|=1000,|U|=6,\alpha_1=0.75,\beta_1=0.55,\alpha_2=0.95,\beta_2=0.75$
  • Figure 4: $|S|=400,|U|=4,\alpha_1=0.95,\beta_1=0.75,\alpha_2=0.75,\beta_2=0.55$
  • Figure 5: $|S|=10000,|U|=8,\alpha_1=0.75,\beta_1=0.55,\alpha_2=0.95,\beta_2=0.75$
  • ...and 6 more figures

Theorems & Definitions (10)

  • Lemma 1
  • proof
  • Proposition 2
  • proof
  • Theorem 3
  • proof
  • Lemma 4
  • proof
  • Theorem 5
  • Remark 6