Table of Contents
Fetching ...

Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learning

Michal Nauman, Michał Bortkiewicz, Piotr Miłoś, Tomasz Trzciński, Mateusz Ostaszewski, Marek Cygan

TL;DR

This work implemented over 60 different off-policy agents, each integrating established regularization techniques from recent state-of-the-art algorithms, revealing that while the effectiveness of a specific regularization setup varies with the task, certain combinations consistently demonstrate robust and superior performance.

Abstract

Recent advancements in off-policy Reinforcement Learning (RL) have significantly improved sample efficiency, primarily due to the incorporation of various forms of regularization that enable more gradient update steps than traditional agents. However, many of these techniques have been tested in limited settings, often on tasks from single simulation benchmarks and against well-known algorithms rather than a range of regularization approaches. This limits our understanding of the specific mechanisms driving RL improvements. To address this, we implemented over 60 different off-policy agents, each integrating established regularization techniques from recent state-of-the-art algorithms. We tested these agents across 14 diverse tasks from 2 simulation benchmarks, measuring training metrics related to overestimation, overfitting, and plasticity loss -- issues that motivate the examined regularization techniques. Our findings reveal that while the effectiveness of a specific regularization setup varies with the task, certain combinations consistently demonstrate robust and superior performance. Notably, a simple Soft Actor-Critic agent, appropriately regularized, reliably finds a better-performing policy within the training regime, which previously was achieved mainly through model-based approaches.

Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learning

TL;DR

This work implemented over 60 different off-policy agents, each integrating established regularization techniques from recent state-of-the-art algorithms, revealing that while the effectiveness of a specific regularization setup varies with the task, certain combinations consistently demonstrate robust and superior performance.

Abstract

Recent advancements in off-policy Reinforcement Learning (RL) have significantly improved sample efficiency, primarily due to the incorporation of various forms of regularization that enable more gradient update steps than traditional agents. However, many of these techniques have been tested in limited settings, often on tasks from single simulation benchmarks and against well-known algorithms rather than a range of regularization approaches. This limits our understanding of the specific mechanisms driving RL improvements. To address this, we implemented over 60 different off-policy agents, each integrating established regularization techniques from recent state-of-the-art algorithms. We tested these agents across 14 diverse tasks from 2 simulation benchmarks, measuring training metrics related to overestimation, overfitting, and plasticity loss -- issues that motivate the examined regularization techniques. Our findings reveal that while the effectiveness of a specific regularization setup varies with the task, certain combinations consistently demonstrate robust and superior performance. Notably, a simple Soft Actor-Critic agent, appropriately regularized, reliably finds a better-performing policy within the training regime, which previously was achieved mainly through model-based approaches.
Paper Structure (39 sections, 3 equations, 26 figures, 2 tables)

This paper contains 39 sections, 3 equations, 26 figures, 2 tables.

Figures (26)

  • Figure 1: IQM performance of First-Order Marginalization. The left column presents results for baseline SAC augmented with a single regularization technique (and thus uses 10 seeds per task), and the right column presents the aggregate performance of a specific regularization technique when paired with other regularizations (and thus uses 640 seeds per task). Results are presented for MW (top row), DMC without Dog environments (middle row) and only Dog-run and Dog-trot (bottom row) benchmarks. 14 tasks.
  • Figure 2: Second-order results marginalizing critic regularization methods. On the x-axis, we have different types of plasticity regularization, and each colour denotes network regularization. For better readability, points within one plasticity regularization are spaced slightly horizontally. Vertical lines indicate standard error.
  • Figure 3: Mean return evolution across 4 million timesteps for Dog-Run (top row) and Dog-Trot (bottom row) environments. Gray plot depicts model-based agent performance. Each plot showcases the top three combinations.
  • Figure 4: IQM performance of the top six intervention pair combinations based on 1 million steps experiments. The IQM is calculated based on the average of the last ten evaluation points in each run, not the last evaluation point. Results come from 1 million steps experiments. Top row: Dog-Run. Bottom row: Dog-Trot.
  • Figure 5: Explanatory metrics correlations for three different groups of environment, namely: MetaWorld, DMC Dog environments, and DMC environments without Dog environments. It's important to observe that not only does the main explanatory metric, gradient norm, vary for dog environments, but the remaining DMC environments also exhibit a different correlation sign for this metric.
  • ...and 21 more figures