Table of Contents
Fetching ...

The Role of Deep Learning Regularizations on Actors in Offline RL

Denis Tarasov, Anja Surina, Caglar Gulcehre

TL;DR

It is empirically show that applying standard regularization techniques to actor networks in offline RL actor-critic algorithms yields improvements of 6% on average across two algorithms and three different continuous D4RL domains.

Abstract

Deep learning regularization techniques, such as dropout, layer normalization, or weight decay, are widely adopted in the construction of modern artificial neural networks, often resulting in more robust training processes and improved generalization capabilities. However, in the domain of Reinforcement Learning (RL), the application of these techniques has been limited, usually applied to value function estimators (Hiraoka et al., 2021; Smith et al., 2022), and may result in detrimental effects. This issue is even more pronounced in offline RL settings, which bear greater similarity to supervised learning but have received less attention. Recent work in continuous offline RL (Park et al., 2024) has demonstrated that while we can build sufficiently powerful critic networks, the generalization of actor networks remains a bottleneck. In this study, we empirically show that applying standard regularization techniques to actor networks in offline RL actor-critic algorithms yields improvements of 6% on average across two algorithms and three different continuous D4RL domains.

The Role of Deep Learning Regularizations on Actors in Offline RL

TL;DR

It is empirically show that applying standard regularization techniques to actor networks in offline RL actor-critic algorithms yields improvements of 6% on average across two algorithms and three different continuous D4RL domains.

Abstract

Deep learning regularization techniques, such as dropout, layer normalization, or weight decay, are widely adopted in the construction of modern artificial neural networks, often resulting in more robust training processes and improved generalization capabilities. However, in the domain of Reinforcement Learning (RL), the application of these techniques has been limited, usually applied to value function estimators (Hiraoka et al., 2021; Smith et al., 2022), and may result in detrimental effects. This issue is even more pronounced in offline RL settings, which bear greater similarity to supervised learning but have received less attention. Recent work in continuous offline RL (Park et al., 2024) has demonstrated that while we can build sufficiently powerful critic networks, the generalization of actor networks remains a bottleneck. In this study, we empirically show that applying standard regularization techniques to actor networks in offline RL actor-critic algorithms yields improvements of 6% on average across two algorithms and three different continuous D4RL domains.
Paper Structure (34 sections, 7 equations, 20 figures, 8 tables)

This paper contains 34 sections, 7 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: rliable metrics for RAR scores averaged over D4RL subset when regularizations are applied individually. Top row ReBRAC performance, bottom row IQL performance. Left graphs Median, IQM, Mean and Optimality Gap, right graphs Probability of Improvement against original algorithm. Each task was run over 5 random seeds.
  • Figure 2: Median, IQM, Mean and Optimality Gap rliable metrics for RAR scores averaged over D4RL subset when regularizations are combined. Top row ReBRAC performance, bottom row IQL performance. Each task was run over 5 random seeds.
  • Figure 3: rliable metrics for last checkpoint performance averaged over all Gym-MuJoCo, AntMaze, and Adroit datasets when regularizations hyperparameters are tuned per dataset. Top row Median, IQM, Mean and Optimality Gap, bottom Probability of Improvement, and Performance Profiles. Every task evaluated over 10 seeds.
  • Figure 4: Final checkpoint performance with noise injection divided by the noise-free performance. We clip outliers at the level of 1.1 due to the metric sensitivity to some Adroit tasks.
  • Figure 5: Validation data metrics divided by the train data metrics from actor penultimate linear layer of the final checkpoint averaged over subset of D4RL datasets.
  • ...and 15 more figures