Table of Contents
Fetching ...

Curriculum Reinforcement Learning for Complex Reward Functions

Kilian Freitag, Kristian Ceder, Rita Laezza, Knut Åkesson, Morteza Haghir Chehreghani

TL;DR

A two-stage reward curriculum that first maximizes a simple reward function and then transitions to the full, complex reward and introduces a method based on how well an actor fits a critic to automatically determine the transition point between the two stages.

Abstract

Reinforcement learning (RL) has emerged as a powerful tool for tackling control problems, but its practical application is often hindered by the complexity arising from intricate reward functions with multiple terms. The reward hypothesis posits that any objective can be encapsulated in a scalar reward function, yet balancing individual, potentially adversarial, reward terms without exploitation remains challenging. To overcome the limitations of traditional RL methods, which often require precise balancing of competing reward terms, we propose a two-stage reward curriculum that first maximizes a simple reward function and then transitions to the full, complex reward. We provide a method based on how well an actor fits a critic to automatically determine the transition point between the two stages. Additionally, we introduce a flexible replay buffer that enables efficient phase transfer by reusing samples from one stage in the next. We evaluate our method on the DeepMind control suite, modified to include an additional constraint term in the reward definitions. We further evaluate our method in a mobile robot scenario with even more competing reward terms. In both settings, our two-stage reward curriculum achieves a substantial improvement in performance compared to a baseline trained without curriculum. Instead of exploiting the constraint term in the reward, it is able to learn policies that balance task completion and constraint satisfaction. Our results demonstrate the potential of two-stage reward curricula for efficient and stable RL in environments with complex rewards, paving the way for more robust and adaptable robotic systems in real-world applications.

Curriculum Reinforcement Learning for Complex Reward Functions

TL;DR

A two-stage reward curriculum that first maximizes a simple reward function and then transitions to the full, complex reward and introduces a method based on how well an actor fits a critic to automatically determine the transition point between the two stages.

Abstract

Reinforcement learning (RL) has emerged as a powerful tool for tackling control problems, but its practical application is often hindered by the complexity arising from intricate reward functions with multiple terms. The reward hypothesis posits that any objective can be encapsulated in a scalar reward function, yet balancing individual, potentially adversarial, reward terms without exploitation remains challenging. To overcome the limitations of traditional RL methods, which often require precise balancing of competing reward terms, we propose a two-stage reward curriculum that first maximizes a simple reward function and then transitions to the full, complex reward. We provide a method based on how well an actor fits a critic to automatically determine the transition point between the two stages. Additionally, we introduce a flexible replay buffer that enables efficient phase transfer by reusing samples from one stage in the next. We evaluate our method on the DeepMind control suite, modified to include an additional constraint term in the reward definitions. We further evaluate our method in a mobile robot scenario with even more competing reward terms. In both settings, our two-stage reward curriculum achieves a substantial improvement in performance compared to a baseline trained without curriculum. Instead of exploiting the constraint term in the reward, it is able to learn policies that balance task completion and constraint satisfaction. Our results demonstrate the potential of two-stage reward curricula for efficient and stable RL in environments with complex rewards, paving the way for more robust and adaptable robotic systems in real-world applications.

Paper Structure

This paper contains 18 sections, 21 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of the standard TD3 algorithm with reward curriculum version (RC-TD3). Fig. a) shows the normalized mean episode reward visualized using constraint weight $w_c=1.0$ for intuitive comparability between policies trained on different constraint weights $w_c$. As it can be seen, the curriculum becomes more effective with higher $w_c$. It seems to be most helpful for environments where the constraints are not automatically optimized by completing the task, i.e. the ones in the center of the left plot with $w_c=0.0$. In environments where learning is unsuccessful in the first place or that are relatively simple even with constraints the effects of the curriculum are less pronounced. Fig. b) shows the mean base reward $r_b$ without constraints. Importantly, employing a reward curriculum manages to keep or improve the base reward in almost all cases, especially for high $w_c$. This demonstrates its effectiveness in finding a better trade-off between task performance and constraint satisfaction, given that the baseline often gets stuck in the local minima of only optimizing constraints as in the case of finger spin.
  • Figure 2: Functions for dense reward terms with $\kappa=0.942$, $v_{\text{ref}}=1.2$ and $d_{\text{track,max}}=5$. The range for each term is normalized to $[-1, 1]$. Green shows the reward shaping term that enables finding the goal faster. The soft constraints are colored in red.
  • Figure 3: Exemplary environment maps used for training. Obstacle positions, paths, initial states, and goal positions are randomized. While maps 0 and 2 contain dynamic obstacles, maps 1 and 3 only contain static ones.
  • Figure 4: Investigation of the median episode reward of RC-TD3 to resetting the network weights and resetting the networks when changing curriculum phases. The values are smoothed by taking the running average with window size 50 [k]. As it can be seen, resetting the network deteriorates performances while resetting the replay buffer has relatively little influence on the final outcomes. We conclude that the main benefit of a reward curriculum in these environments comes from improved exploration given by a "pretrained" network. The gray dashed line indicated the mean time when the curriculum phase was switched in RC-TD3.
  • Figure 5: Comparison of the median episode reward of RC-TD3 with automatic curriculum switch as described in Section \ref{['sec:autoswitch']} to switching at times T/8, T/3, and T/2 where T are the total number of timesteps. The values are smoothed by taking the running average with window size 50 [k] and the dashed lines indicate the time each curriculum switched phases. The results indicate that for some environments such as cartpole swingup, the exact time does not seem to matter while the automatic switch manages to pick a more beneficial time in the case of walker run. Importantly, all reward curriculum versions work better than not employing a curriculum and our method, RC-TD3, yields one of the best results in all cases.
  • ...and 3 more figures