Table of Contents
Fetching ...

ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization

Tianying Ji, Yongyuan Liang, Yan Zeng, Yu Luo, Guowei Xu, Jiawei Guo, Ruijie Zheng, Furong Huang, Fuchun Sun, Huazhe Xu

TL;DR

The proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks spanning 7 domains compared to model-free RL baselines, which underscores the effectiveness, versatility, and efficient sample efficiency of the approach.

Abstract

The varying significance of distinct primitive behaviors during the policy learning process has been overlooked by prior model-free RL algorithms. Leveraging this insight, we explore the causal relationship between different action dimensions and rewards to evaluate the significance of various primitive behaviors during training. We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration. Furthermore, to prevent excessive focus on specific primitive behaviors, we analyze the gradient dormancy phenomenon and introduce a dormancy-guided reset mechanism to further enhance the efficacy of our method. Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks spanning 7 domains compared to model-free RL baselines, which underscores the effectiveness, versatility, and efficient sample efficiency of our approach. Benchmark results and videos are available at https://ace-rl.github.io/.

ACE : Off-Policy Actor-Critic with Causality-Aware Entropy Regularization

TL;DR

The proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks spanning 7 domains compared to model-free RL baselines, which underscores the effectiveness, versatility, and efficient sample efficiency of the approach.

Abstract

The varying significance of distinct primitive behaviors during the policy learning process has been overlooked by prior model-free RL algorithms. Leveraging this insight, we explore the causal relationship between different action dimensions and rewards to evaluate the significance of various primitive behaviors during training. We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration. Furthermore, to prevent excessive focus on specific primitive behaviors, we analyze the gradient dormancy phenomenon and introduce a dormancy-guided reset mechanism to further enhance the efficacy of our method. Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks spanning 7 domains compared to model-free RL baselines, which underscores the effectiveness, versatility, and efficient sample efficiency of our approach. Benchmark results and videos are available at https://ace-rl.github.io/.
Paper Structure (43 sections, 10 theorems, 10 equations, 32 figures, 2 tables, 1 algorithm)

This paper contains 43 sections, 10 theorems, 10 equations, 32 figures, 2 tables, 1 algorithm.

Key Result

Proposition 3.3

Under the assumptions that the causal graph is Markov and faithful to the observations, there exists an edge from $a_{i,t}$ to $r_t$ if and only if $a_{i,t} \not\!\perp\!\!\!\perp r_t | {\mathbf{s}}_{t}, {\mathbf{a}}_{-i, t}$, where ${\mathbf{a}}_{-i, t}$ are states of ${\mathbf{a}}_t$ except $a_{i,

Figures (32)

  • Figure 1: (Top): Learning process of a manipulator. A robotic arm learns to manipulate objects in a manner akin to human learning. This arm would be programmed with four primitive behaviors for its end-effector: vertical movements along the z-axis (up and down), horizontal movements along the x-axis (left and right), depth movements along the y-axis (forward and backward), and grasping (apply torque). (Bottom): Comparison of normalized score. Our ACE demonstrates a significant superiority over the widely used model-free RL baselines SAC and TD3 with a single set of hyperparameters.
  • Figure 2: Motivating example. This task involves a robotic arm hammering a screw into a wall. $\bullet$ Initially, the robotic arm approaches the desk moving on the z-axis and struggles with torque grasping, making z-axis positioning $\uparrow$ and torque exploration $\uparrow$ a priority. $\blacktriangle$ As the training advances, the agent's focus shifts to optimizing movement, prioritizing end-effector position (x-axis $\uparrow$ and y-axis $\uparrow$). $\bigstar$ Finally, potential improvements lie in the stable and swift hammering, shifting focus back to torque $\uparrow$ and placing down the object $\uparrow$. The evolving causal weights, depicted on the left, reflect these changing priorities. See more examples in Appendix \ref{['sec:example']}.
  • Figure 3: Dormancy degree curves for SAC, CausalSAC, and ACE in MetaWorld tasks, which indicates that the gradient-dormancy-guided reset mechanism effectively reduces gradient dormancy degrees, contributing to the best performance of ACE .
  • Figure 4: Manipulation tasks. Success rate of ACE, SAC, TD3 on manipulation tasks from the MetaWorld benchmark suite. Solid curves depict the mean of six trials, and shaded regions correspond to the one standard deviation. More results are in Appendix Figure \ref{['fig:metaworld']}.
  • Figure 5: Locomotion tasks. Average return of ACE SAC, TD3 on locomotion tasks provided by MuJoCo and DMControl benchmark suites. Solid curves depict the mean of six trials, and shaded regions correspond to the one standard deviation. See Figure \ref{['fig:mujoco']} and \ref{['fig:dmcontrol']} in the Appendix for an overall comparison of locomotion tasks.
  • ...and 27 more figures

Theorems & Definitions (17)

  • Proposition 3.3
  • Theorem 3.4
  • Proposition 3.5: Policy evaluation
  • Proposition 3.6: Policy improvement
  • Proposition 3.7: Policy iteration
  • Definition 3.8: Gradient-dormant Neurons
  • Definition 3.9: $\tau$-Dormancy Degree $\alpha_\tau$
  • Proposition 1.3
  • proof
  • Theorem 1.4
  • ...and 7 more