Value Improved Actor Critic Algorithms
Yaniv Oren, Moritz A. Zanger, Pascal R. van der Vaart, Mustafa Mert Celikok, Matthijs T. J. Spaan, Wendelin Bohmer
TL;DR
The authors tackle the tradeoff between greedification and stability in gradient-based actor-critic methods by decoupling the acting policy from the critic-evaluated policy, introducing Value-Improved Actor Critic (VIAC). They provide a formal convergence analysis for a broad class of greedification operators under generalized policy iteration in finite-horizon MDPs and show that policy improvement alone is not sufficient for convergence with stochastic policies. The framework supports both explicit and implicit value-improvement mechanisms, including expectile-based implicit greedification, and their VIAC variants (e.g., VI-TD3, VI-SAC) demonstrate consistent, compute-efficient improvements across DeepMind control tasks. The work connects VIAC to existing methods (GMZ, IQL/IQL-like offline methods, MPO variants) and provides practical guidance on operator design, with empirical results indicating robust improvements and negligible overhead. Overall, VIAC offers a unified theory and practical toolkit for augmenting actor-critic algorithms with value-driven improvements to enhance learning speed and stability.
Abstract
To learn approximately optimal acting policies for decision problems, modern Actor Critic algorithms rely on deep Neural Networks (DNNs) to parameterize the acting policy and greedification operators to iteratively improve it. The reliance on DNNs suggests an improvement that is gradient based, which is per step much less greedy than the improvement possible by greedier operators such as the greedy update used by Q-learning algorithms. On the other hand, slow changes to the policy can also be beneficial for the stability of the learning process, resulting in a tradeoff between greedification and stability. To better address this tradeoff, we propose to decouple the acting policy from the policy evaluated by the critic. This allows the agent to separately improve the critic's policy (e.g. value improvement) with greedier updates while maintaining the slow gradient-based improvement to the parameterized acting policy. We investigate the convergence of this approach using the popular analysis scheme of generalized Policy Iteration in the finite-horizon domain. Empirically, incorporating value-improvement into the popular off-policy actor-critic algorithms TD3 and SAC significantly improves or matches performance over their respective baselines, across different environments from the DeepMind continuous control domain, with negligible compute and implementation cost.
