Table of Contents
Fetching ...

Clustering-Based Weight Orthogonalization for Stabilizing Deep Reinforcement Learning

Guoqing Ma, Yuhan Zhang, Yuming Dai, Guangfu Hao, Yang Chen, Shan Yu

TL;DR

The Clustering Orthogonal Weight Modified (COWM) layer is introduced, which can be integrated into the policy network of any RL algorithm and mitigate non-stationarity effectively effectively and improves learning speed but also reduces gradient interference, thereby enhancing the overall learning efficiency.

Abstract

Reinforcement learning (RL) has made significant advancements, achieving superhuman performance in various tasks. However, RL agents often operate under the assumption of environmental stationarity, which poses a great challenge to learning efficiency since many environments are inherently non-stationary. This non-stationarity results in the requirement of millions of iterations, leading to low sample efficiency. To address this issue, we introduce the Clustering Orthogonal Weight Modified (COWM) layer, which can be integrated into the policy network of any RL algorithm and mitigate non-stationarity effectively. The COWM layer stabilizes the learning process by employing clustering techniques and a projection matrix. Our approach not only improves learning speed but also reduces gradient interference, thereby enhancing the overall learning efficiency. Empirically, the COWM outperforms state-of-the-art methods and achieves improvements of 9% and 12.6% in vision based and state-based DMControl benchmark. It also shows robustness and generality across various algorithms and tasks.

Clustering-Based Weight Orthogonalization for Stabilizing Deep Reinforcement Learning

TL;DR

The Clustering Orthogonal Weight Modified (COWM) layer is introduced, which can be integrated into the policy network of any RL algorithm and mitigate non-stationarity effectively effectively and improves learning speed but also reduces gradient interference, thereby enhancing the overall learning efficiency.

Abstract

Reinforcement learning (RL) has made significant advancements, achieving superhuman performance in various tasks. However, RL agents often operate under the assumption of environmental stationarity, which poses a great challenge to learning efficiency since many environments are inherently non-stationary. This non-stationarity results in the requirement of millions of iterations, leading to low sample efficiency. To address this issue, we introduce the Clustering Orthogonal Weight Modified (COWM) layer, which can be integrated into the policy network of any RL algorithm and mitigate non-stationarity effectively. The COWM layer stabilizes the learning process by employing clustering techniques and a projection matrix. Our approach not only improves learning speed but also reduces gradient interference, thereby enhancing the overall learning efficiency. Empirically, the COWM outperforms state-of-the-art methods and achieves improvements of 9% and 12.6% in vision based and state-based DMControl benchmark. It also shows robustness and generality across various algorithms and tasks.

Paper Structure

This paper contains 12 sections, 19 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Non-stationarity in single-task reinforcement learning and its solutions. (a) Multiple sub-policies may exist within a single task. For example, in the walker walk task, the control of the robot can be divided into two sub-policies: learn to stand and learn to walk. (b) Distribution of state features over the course of training. (c) The COWM layer automatically identifies and protects the old policy. The gray arrows represent the input space for each layer of the neural network. The green and blue arrows indicate the input vectors during the learning process for the new and old policies, respectively. The green plane indicates the null space of the old policy. The red arrows show the gradient calculated by stochastic gradient descent for the new policy. The red dashed arrows represent the projection of the gradient into the null space.
  • Figure 2: The COWM layer in the model-based reinforcement learning (MBRL) framework. (a) COWM replaces the linear layer within the actor, while other structures remain unchanged. (b) The internal structure and computational flow of the COWM layer. Computational flow is divided into two main processes: forward propagation (orange) and backward propagation (blue). The green section computes the projection matrix using historical input data. During backward propagation, the projection constrains the gradient using the projection matrix, resulting in the final weight updates.
  • Figure 3: Training curves for 5 tasks in vision-based DMControl. SAC, CURL and DreamerV3 are compared. We replicated the results of SAC, CURL, and DreamerV3 in the same environments on NVIDIA A40 GPU. Our method is represented in green, while DreamerV3 is in blue, CURL is in red and SAC is in yellow. Mean and 95% CIs over 3 seeds.
  • Figure 4: Performance comparison to the existing methods in state-based DMControl. It contains rewards for 18 tasks under 250K interactions with the environment. For COWM, COWM layers are implemented without any hyperparameter fine-tuning compared with vision-based COWM showing its generalization ability. PPO is regularization based method and D4PG is rehearsal based method to mitigate non-stationarity of DRL.
  • Figure 5: During training with the COWM layer, the representations in the actor network are clustered. There are two cluster centers, meaning the representations are divided into two categories, represented by blue and orange areas. The cluster centers are marked by red crosses.
  • ...and 3 more figures