Table of Contents
Fetching ...

Black box meta-learning intrinsic rewards for sparse-reward environments

Octavio Pappalardo, Rodrigo Ramele, Juan Miguel Santos

TL;DR

This work investigates how meta-learning can improve the training signal received by RL agents under a framework that doesn't rely on the use of meta-gradients and compares this approach to the use of extrinsic rewards and a meta-learned advantage function.

Abstract

Despite the successes and progress of deep reinforcement learning over the last decade, several challenges remain that hinder its broader application. Some fundamental aspects to improve include data efficiency, generalization capability, and ability to learn in sparse-reward environments, which often require human-designed dense rewards. Meta-learning has emerged as a promising approach to address these issues by optimizing components of the learning algorithm to meet desired characteristics. Additionally, a different line of work has extensively studied the use of intrinsic rewards to enhance the exploration capabilities of algorithms. This work investigates how meta-learning can improve the training signal received by RL agents. The focus is on meta-learning intrinsic rewards under a framework that doesn't rely on the use of meta-gradients. We analyze and compare this approach to the use of extrinsic rewards and a meta-learned advantage function. The developed algorithms are evaluated on distributions of continuous control tasks with both parametric and non-parametric variations, and with only sparse rewards accessible for the evaluation tasks.

Black box meta-learning intrinsic rewards for sparse-reward environments

TL;DR

This work investigates how meta-learning can improve the training signal received by RL agents under a framework that doesn't rely on the use of meta-gradients and compares this approach to the use of extrinsic rewards and a meta-learned advantage function.

Abstract

Despite the successes and progress of deep reinforcement learning over the last decade, several challenges remain that hinder its broader application. Some fundamental aspects to improve include data efficiency, generalization capability, and ability to learn in sparse-reward environments, which often require human-designed dense rewards. Meta-learning has emerged as a promising approach to address these issues by optimizing components of the learning algorithm to meet desired characteristics. Additionally, a different line of work has extensively studied the use of intrinsic rewards to enhance the exploration capabilities of algorithms. This work investigates how meta-learning can improve the training signal received by RL agents. The focus is on meta-learning intrinsic rewards under a framework that doesn't rely on the use of meta-gradients. We analyze and compare this approach to the use of extrinsic rewards and a meta-learned advantage function. The developed algorithms are evaluated on distributions of continuous control tasks with both parametric and non-parametric variations, and with only sparse rewards accessible for the evaluation tasks.
Paper Structure (18 sections, 1 equation, 4 figures, 1 algorithm)

This paper contains 18 sections, 1 equation, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of the average performance of agents as they interact with tasks from the test set. The values and their standard deviations (represented by the shaded region) were obtained as explained in section \ref{['sec:eval_methodology']}. The success rate when training agents using three different types of rewards is shown: intrinsic (red), shaped extrinsic (blue), and sparse extrinsic(green). Three benchmarks are considered: ML1-reach, ML1-close-door, and ML1-button-press. The last episode reflects the performance of the final policy when made deterministic.
  • Figure 2: Success rate of agents trained with different rewards on various meta-learning benchmarks, ML1-reach, ML1-close-door, and ML1-button-press, after an adaptation period of 4000 steps. The figure compares the performance when using three different types of rewards: intrinsic (red), shaped extrinsic (blue), and sparse extrinsic(green). The values were obtained as explained in section \ref{['sec:eval_methodology']}.
  • Figure 3: Comparison of the average performance of meta-learning methods as they interact with a new task from the training set (top row) and the test set (bottom row). Success rates are shown for two methods that meta-learned different parameterizations of the loss: using intrinsic rewards (red) and using advantages (blue). Three benchmarks are considered: ML1-reach, ML1-button-press, and ML10. The last episode reflects the performance of the final policy when made deterministic.
  • Figure 4: Performance comparison of different methods that meta-learn policy parameters as they interact with tasks from the training (top row) and test set (bottom row). The values and their standard deviations (represented by the shaded region) were obtained as explained in section \ref{['sec:eval_methodology']}. The success rate for $RL^2$ (green) and MAML (blue) in benchmarks ML1-reach, ML1-button-press and ML10 are shown.