Table of Contents
Fetching ...

Loss- and Reward-Weighting for Efficient Distributed Reinforcement Learning

Martin Holen, Per-Arne Andersen, Kristian Muri Knausgård, Morten Goodwin

TL;DR

This work targets efficient gradient aggregation in distributed reinforcement learning by introducing two weighted-merger schemes: Reward-Weighted (R-Weighted) and Loss-Weighted (L-Weighted). Both methods scale each actor’s gradient according to its relative performance signal, using a parameter-server framework with a minimum contribution bound $1/h$. Empirical results across several continuous-control tasks show that L-Weighted delivers the strongest gains (up to $13.84\%$ on average for the cumulative reward) while R-Weighted provides more modest but stable improvements (up to $2.33\%$ on average). The approaches require minimal changes to standard backpropagation and can accelerate convergence and improve final performance in distributed RL environments, especially in continuous-action domains.

Abstract

This paper introduces two learning schemes for distributed agents in Reinforcement Learning (RL) environments, namely Reward-Weighted (R-Weighted) and Loss-Weighted (L-Weighted) gradient merger. The R/L weighted methods replace standard practices for training multiple agents, such as summing or averaging the gradients. The core of our methods is to scale the gradient of each actor based on how high the reward (for R-Weighted) or the loss (for L-Weighted) is compared to the other actors. During training, each agent operates in differently initialized versions of the same environment, which gives different gradients from different actors. In essence, the R-Weights and L-Weights of each agent inform the other agents of its potential, which again reports which environment should be prioritized for learning. This approach of distributed learning is possible because environments that yield higher rewards, or low losses, have more critical information than environments that yield lower rewards or higher losses. We empirically demonstrate that the R-Weighted methods work superior to the state-of-the-art in multiple RL environments.

Loss- and Reward-Weighting for Efficient Distributed Reinforcement Learning

TL;DR

This work targets efficient gradient aggregation in distributed reinforcement learning by introducing two weighted-merger schemes: Reward-Weighted (R-Weighted) and Loss-Weighted (L-Weighted). Both methods scale each actor’s gradient according to its relative performance signal, using a parameter-server framework with a minimum contribution bound . Empirical results across several continuous-control tasks show that L-Weighted delivers the strongest gains (up to on average for the cumulative reward) while R-Weighted provides more modest but stable improvements (up to on average). The approaches require minimal changes to standard backpropagation and can accelerate convergence and improve final performance in distributed RL environments, especially in continuous-action domains.

Abstract

This paper introduces two learning schemes for distributed agents in Reinforcement Learning (RL) environments, namely Reward-Weighted (R-Weighted) and Loss-Weighted (L-Weighted) gradient merger. The R/L weighted methods replace standard practices for training multiple agents, such as summing or averaging the gradients. The core of our methods is to scale the gradient of each actor based on how high the reward (for R-Weighted) or the loss (for L-Weighted) is compared to the other actors. During training, each agent operates in differently initialized versions of the same environment, which gives different gradients from different actors. In essence, the R-Weights and L-Weights of each agent inform the other agents of its potential, which again reports which environment should be prioritized for learning. This approach of distributed learning is possible because environments that yield higher rewards, or low losses, have more critical information than environments that yield lower rewards or higher losses. We empirically demonstrate that the R-Weighted methods work superior to the state-of-the-art in multiple RL environments.
Paper Structure (26 sections, 7 equations, 12 figures, 7 tables, 3 algorithms)

This paper contains 26 sections, 7 equations, 12 figures, 7 tables, 3 algorithms.

Figures (12)

  • Figure 1: Systems Flowchart for Baseline-Sum, Baseline-Avg, R-Weighted and L-Weighted
  • Figure 2: Shows the gradient aggregation activity for Baseline-Sum, Baseline-Avg, R-Weighted and L-Weighted
  • Figure 3: Shows the average rewards for PPO while training the Cartpole environment.
  • Figure 4: Shows the average rewards for PPO while training the Cartpole environment, the A3C and IMPALA Algorithm got normalized based on the amount of time it used to run, compared to the Baseline-sum. 200 update steps for each of them shows the amount of time it took for the baseline, while either algorithm could spend many more updates for the steps.
  • Figure 5: Shows the average rewards for each algorithm using PPO during training in the LunarLander environment
  • ...and 7 more figures