Table of Contents
Fetching ...

Value Improved Actor Critic Algorithms

Yaniv Oren, Moritz A. Zanger, Pascal R. van der Vaart, Mustafa Mert Celikok, Matthijs T. J. Spaan, Wendelin Bohmer

TL;DR

The authors tackle the tradeoff between greedification and stability in gradient-based actor-critic methods by decoupling the acting policy from the critic-evaluated policy, introducing Value-Improved Actor Critic (VIAC). They provide a formal convergence analysis for a broad class of greedification operators under generalized policy iteration in finite-horizon MDPs and show that policy improvement alone is not sufficient for convergence with stochastic policies. The framework supports both explicit and implicit value-improvement mechanisms, including expectile-based implicit greedification, and their VIAC variants (e.g., VI-TD3, VI-SAC) demonstrate consistent, compute-efficient improvements across DeepMind control tasks. The work connects VIAC to existing methods (GMZ, IQL/IQL-like offline methods, MPO variants) and provides practical guidance on operator design, with empirical results indicating robust improvements and negligible overhead. Overall, VIAC offers a unified theory and practical toolkit for augmenting actor-critic algorithms with value-driven improvements to enhance learning speed and stability.

Abstract

To learn approximately optimal acting policies for decision problems, modern Actor Critic algorithms rely on deep Neural Networks (DNNs) to parameterize the acting policy and greedification operators to iteratively improve it. The reliance on DNNs suggests an improvement that is gradient based, which is per step much less greedy than the improvement possible by greedier operators such as the greedy update used by Q-learning algorithms. On the other hand, slow changes to the policy can also be beneficial for the stability of the learning process, resulting in a tradeoff between greedification and stability. To better address this tradeoff, we propose to decouple the acting policy from the policy evaluated by the critic. This allows the agent to separately improve the critic's policy (e.g. value improvement) with greedier updates while maintaining the slow gradient-based improvement to the parameterized acting policy. We investigate the convergence of this approach using the popular analysis scheme of generalized Policy Iteration in the finite-horizon domain. Empirically, incorporating value-improvement into the popular off-policy actor-critic algorithms TD3 and SAC significantly improves or matches performance over their respective baselines, across different environments from the DeepMind continuous control domain, with negligible compute and implementation cost.

Value Improved Actor Critic Algorithms

TL;DR

The authors tackle the tradeoff between greedification and stability in gradient-based actor-critic methods by decoupling the acting policy from the critic-evaluated policy, introducing Value-Improved Actor Critic (VIAC). They provide a formal convergence analysis for a broad class of greedification operators under generalized policy iteration in finite-horizon MDPs and show that policy improvement alone is not sufficient for convergence with stochastic policies. The framework supports both explicit and implicit value-improvement mechanisms, including expectile-based implicit greedification, and their VIAC variants (e.g., VI-TD3, VI-SAC) demonstrate consistent, compute-efficient improvements across DeepMind control tasks. The work connects VIAC to existing methods (GMZ, IQL/IQL-like offline methods, MPO variants) and provides practical guidance on operator design, with empirical results indicating robust improvements and negligible overhead. Overall, VIAC offers a unified theory and practical toolkit for augmenting actor-critic algorithms with value-driven improvements to enhance learning speed and stability.

Abstract

To learn approximately optimal acting policies for decision problems, modern Actor Critic algorithms rely on deep Neural Networks (DNNs) to parameterize the acting policy and greedification operators to iteratively improve it. The reliance on DNNs suggests an improvement that is gradient based, which is per step much less greedy than the improvement possible by greedier operators such as the greedy update used by Q-learning algorithms. On the other hand, slow changes to the policy can also be beneficial for the stability of the learning process, resulting in a tradeoff between greedification and stability. To better address this tradeoff, we propose to decouple the acting policy from the policy evaluated by the critic. This allows the agent to separately improve the critic's policy (e.g. value improvement) with greedier updates while maintaining the slow gradient-based improvement to the parameterized acting policy. We investigate the convergence of this approach using the popular analysis scheme of generalized Policy Iteration in the finite-horizon domain. Empirically, incorporating value-improvement into the popular off-policy actor-critic algorithms TD3 and SAC significantly improves or matches performance over their respective baselines, across different environments from the DeepMind continuous control domain, with negligible compute and implementation cost.
Paper Structure (55 sections, 13 theorems, 45 equations, 8 figures, 1 table, 4 algorithms)

This paper contains 55 sections, 13 theorems, 45 equations, 8 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Let $\pi$ and $\pi'$ be two policies such that $\forall s \in \mathcal{S}$: In addition, if there is strict inequality of Equation eq:greedification at any state, then there must be strict inequality of Equation eq:improvement at at least one state.

Figures (8)

  • Figure 1: Mean and one standard error in the shaded area across 10 seeds for VI-TD3 with $\mathcal{I}_2$ the deterministic policy gradient and increasing number of gradient steps (pg=n), with baseline (pg=0) TD3 for reference.
  • Figure 2: Mean and one standard error across 10 seeds for VI-TD3 with expectile loss with different values of the expectile parameter $\tau$. Performance increases up to $\tau=0.8$ and then decays.
  • Figure 3: Baseline TD3 (dashed) and SAC (solid) in red vs. VI-TD3/SAC (blue) with expectile loss and BoN (dot-dashed). Mean and two standard errors ($\approx 95\%$ Gaussian CI) in the shaded area of evaluation curves across 10 seeds for BoN and 20 for the other agents.
  • Figure 4: Mean and one standard error across 10 seeds. Left: Final evaluation vs. greedification parameter $\tau$ for VI-TD3 with implicit improvement after $3m$ environment interactions. $\tau=0.5$ is baseline TD3. Right: Final overestimation bias vs. $\tau$ after $3m$ environment interactions. The majority of the performance increases are independent from an increase in over estimation bias.
  • Figure 5: Mean and two standard errors across 10 seeds of VI-TD7 with expectile loss vs. TD7 on the same tasks as Figure \ref{['fig:results_vi_td3_vi_sac']}. Similar performance gains are observed for VI-TD7 in this domain.
  • ...and 3 more figures

Theorems & Definitions (31)

  • Definition 1: Policy Improvement Operator
  • Theorem 1: Policy Improvement
  • Definition 2: Greedification Operator
  • Theorem 2: Improvement is not enough
  • Definition 3: Necessary Greedification
  • Definition 4: Lower Bounded Greedification
  • Definition 5: Limit-Sufficient Greedification
  • Theorem 3: Convergence of Algorithms \ref{['alg:api']} and \ref{['alg:vi_api']}
  • Corollary 1
  • Corollary 2
  • ...and 21 more