Table of Contents
Fetching ...

Boosting the Actor with Dual Critic

Bo Dai, Albert Shaw, Niao He, Lihong Li, Le Song

TL;DR

The paper reframes policy optimization as a two-player game between an actor and a dual critic by deriving a Lagrangian dual form of the Bellman optimality equation. It introduces Dual-AC, a multi-step, path-regularized, stochastic dual ascent algorithm that updates the value function, dual weights, and policy in a coordinated way to optimize a common objective. The approach addresses instability in function-approximation settings, demonstrates local duality via path regularization, and achieves state-of-the-art or competitive results on MuJoCo continuous-control benchmarks. This framework provides a unified, theoretically grounded pathway for stable, efficient actor-critic learning with principled off-policy data utilization.

Abstract

This paper proposes a new actor-critic-style algorithm called Dual Actor-Critic or Dual-AC. It is derived in a principled way from the Lagrangian dual form of the Bellman optimality equation, which can be viewed as a two-player game between the actor and a critic-like function, which is named as dual critic. Compared to its actor-critic relatives, Dual-AC has the desired property that the actor and dual critic are updated cooperatively to optimize the same objective function, providing a more transparent way for learning the critic that is directly related to the objective function of the actor. We then provide a concrete algorithm that can effectively solve the minimax optimization problem, using techniques of multi-step bootstrapping, path regularization, and stochastic dual ascent algorithm. We demonstrate that the proposed algorithm achieves the state-of-the-art performances across several benchmarks.

Boosting the Actor with Dual Critic

TL;DR

The paper reframes policy optimization as a two-player game between an actor and a dual critic by deriving a Lagrangian dual form of the Bellman optimality equation. It introduces Dual-AC, a multi-step, path-regularized, stochastic dual ascent algorithm that updates the value function, dual weights, and policy in a coordinated way to optimize a common objective. The approach addresses instability in function-approximation settings, demonstrates local duality via path regularization, and achieves state-of-the-art or competitive results on MuJoCo continuous-control benchmarks. This framework provides a unified, theoretically grounded pathway for stable, efficient actor-critic learning with principled off-policy data utilization.

Abstract

This paper proposes a new actor-critic-style algorithm called Dual Actor-Critic or Dual-AC. It is derived in a principled way from the Lagrangian dual form of the Bellman optimality equation, which can be viewed as a two-player game between the actor and a critic-like function, which is named as dual critic. Compared to its actor-critic relatives, Dual-AC has the desired property that the actor and dual critic are updated cooperatively to optimize the same objective function, providing a more transparent way for learning the critic that is directly related to the objective function of the actor. We then provide a concrete algorithm that can effectively solve the minimax optimization problem, using techniques of multi-step bootstrapping, path regularization, and stochastic dual ascent algorithm. We demonstrate that the proposed algorithm achieves the state-of-the-art performances across several benchmarks.

Paper Structure

This paper contains 27 sections, 9 theorems, 58 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

$\sum_{s, a\in{\mathcal{S}}\times\mathcal{A}}\rho^*(s, a)= 1$, and $\pi^*(a|s) = \frac{\rho^*(s, a)}{\sum_{a\in\mathcal{A}} \rho^*(s, a)}$.

Figures (2)

  • Figure 1: Comparison between the Dual-AC and its variants for justifying the analysis of the source of instability.
  • Figure 2: The results of Dual-AC against TRPO and PPO baselines. Each plot shows average reward during training across $5$ random seeded runs, with $50\%$ confidence interval. The x-axis is the number of training iterations. The Dual-AC achieves comparable performances comparing with TRPO and PPO in some tasks, but outperforms on more challenging tasks.

Theorems & Definitions (9)

  • Theorem 1: Policy from dual variables
  • Theorem 2: Competition in one-step setting
  • Theorem 3: Competition in multi-step setting
  • Theorem 4: Property of path regularization
  • Theorem 5
  • Theorem 6
  • Lemma 7
  • Theorem 8
  • Lemma 9