Table of Contents
Fetching ...

Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, Marek Cygan

TL;DR

The paper investigates whether model scaling can improve sample efficiency in continuous-control RL beyond pure algorithmic tweaks. It introduces BRO, a Bigger, Regularized, Optimistic framework that scales the critic via the BroNet architecture and pairs it with optimistic exploration, regularization, and limited but effective replay strategies. Empirical results across 40 tasks from the DeepMind Control, MetaWorld, and MyoSuite suites show BRO achieving state-of-the-art performance, often with fewer environment steps than leading model-based methods such as TD-MPC2, and attaining near-optimal policies on challenging Dog and Humanoid tasks. These findings suggest that judicious critic scaling with strong regularization can substantially impact practical RL performance and motivate standardized benchmarks for fair comparisons in sample-efficient RL research.

Abstract

Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.

Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control

TL;DR

The paper investigates whether model scaling can improve sample efficiency in continuous-control RL beyond pure algorithmic tweaks. It introduces BRO, a Bigger, Regularized, Optimistic framework that scales the critic via the BroNet architecture and pairs it with optimistic exploration, regularization, and limited but effective replay strategies. Empirical results across 40 tasks from the DeepMind Control, MetaWorld, and MyoSuite suites show BRO achieving state-of-the-art performance, often with fewer environment steps than leading model-based methods such as TD-MPC2, and attaining near-optimal policies on challenging Dog and Humanoid tasks. These findings suggest that judicious critic scaling with strong regularization can substantially impact practical RL performance and motivate standardized benchmarks for fair comparisons in sample-efficient RL research.

Abstract

Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
Paper Structure (46 sections, 8 equations, 29 figures, 9 tables, 1 algorithm)

This paper contains 46 sections, 8 equations, 29 figures, 9 tables, 1 algorithm.

Figures (29)

  • Figure 1: BRO sets new state-of-the-art outperforming model-free (MF) and model-based (MB) algorithms on $40$ complex tasks covering $3$ benchmark suites. Y-axes report interquartile mean calculated on 10 random seeds, with 1.0 representing the best possible performance in a given benchmark. We use $1M$ environment steps.
  • Figure 2: We report sample efficiency (left) and wallclock time (right) for BRO and BRO (Fast) (BRO with reduced replay ratio for increased compute efficiency), as well as baseline algorithms averaged over $40$ tasks listed in Table \ref{['table:environemnts']}. BRO achieves the best sample efficiency, whereas BRO (Fast) matches the sample efficiency of model-based TD-MPC2. In terms of wall clock efficiency, BRO runs approximately 25% faster than TD-MPC2. Remarkably, BRO (Fast) matches the wallclock efficiency of a standard SAC agent while achieving 400% better performance. The Y-axis reports the interquartile mean, with 1.0 representing the maximal possible performance.
  • Figure 3: DeepMind Control (DMC)
  • Figure 4: MetaWorld (MW)
  • Figure 5: MyoSuite (MS)
  • ...and 24 more figures