Actor-Critic or Critic-Actor? A Tale of Two Time Scales
Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin
TL;DR
This paper introduces the critic-actor algorithm by reversing the conventional two-time-scale updates of tabular actor-critic, making the value function updates slower and the policy updates faster. It proves convergence of this CA scheme using two-time-scale stochastic approximation and ODE techniques, showing that CA emulates value iteration under the reversed timescales. Empirically, CA achieves accuracy and computational efficiency comparable to or better than standard actor-critic across tabular and function-approximation settings, including linear and neural-network architectures. The work broadens the RL algorithmic landscape by providing a theoretically sound alternative to actor-critic with potential benefits in convergence behavior and practicality for large-scale problems.
Abstract
We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
