Table of Contents
Fetching ...

SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R. Wurman, Jaegul Choo, Peter Stone, Takuma Seno

TL;DR

The paper addresses the challenge of scaling neural networks in deep reinforcement learning without sacrificing generalization. It introduces SimBa, an architecture that embeds a simplicity bias through RSNorm-based observation normalization, a pre-layer normalization residual feedforward block, and post-layer normalization, enabling large parameter counts while stabilizing learning. Empirically, SimBa improves sample efficiency across off-policy, on-policy, and unsupervised RL, and SAC with SimBa matches or surpasses state-of-the-art baselines on 51 tasks with favorable compute. The findings suggest architecture-driven simplicity bias as a practical, scalable path to more capable RL agents, with broad applicability and straightforward implementation.

Abstract

Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms-including off-policy, on-policy, and unsupervised methods-is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa's broad applicability and effectiveness across diverse RL algorithms and environments.

SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

TL;DR

The paper addresses the challenge of scaling neural networks in deep reinforcement learning without sacrificing generalization. It introduces SimBa, an architecture that embeds a simplicity bias through RSNorm-based observation normalization, a pre-layer normalization residual feedforward block, and post-layer normalization, enabling large parameter counts while stabilizing learning. Empirically, SimBa improves sample efficiency across off-policy, on-policy, and unsupervised RL, and SAC with SimBa matches or surpasses state-of-the-art baselines on 51 tasks with favorable compute. The findings suggest architecture-driven simplicity bias as a practical, scalable path to more capable RL agents, with broad applicability and straightforward implementation.

Abstract

Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms-including off-policy, on-policy, and unsupervised methods-is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa's broad applicability and effectiveness across diverse RL algorithms and environments.

Paper Structure

This paper contains 44 sections, 18 equations, 25 figures, 15 tables.

Figures (25)

  • Figure 1: Benchmark Summary.(a) Sample Efficiency: SimBa improves sample efficiency across various RL algorithms, including off-policy (SAC, TD-MPC2), on-policy (PPO), and unsupervised RL (METRA). (b) Compute Efficiency: When applying SimBa with SAC, it matches or surpasses state-of-the-art off-policy RL methods across 51 continuous control tasks, by only modifying the network architecture and scaling up the number of network parameters.
  • Figure 2: (a) SimBa exhibits higher simplicity bias than MLP. (b) SAC with SimBa improves its performance with increased parameters, whereas SAC with MLP degrades it. Each standard deviation is 95% CI.
  • Figure 3: SimBa architecture. The network integrates Running Statistics Normalization (RSNorm), Residual Feedforward Blocks, and Post-Layer Normalization to embed simplicity bias into deep RL.
  • Figure 4: Component Analysis.(a) Simplicity bias scores estimated via Fourier analysis. Mean and 95% CI are computed over 100 random initializations. (b) Average return in DMC-Hard for 1M steps. Mean and 95% CI over 10 seeds, using SAC. Stronger simplicity bias correlates with higher returns for overparameterized networks.
  • Figure 5: Architecture Comparison. (a) SimBa consistently exhibits a higher simplicity bias score. (b) SimBa demonstrates strong scaling performance in terms of average return for DMC-Hard compared to the other architectures. The results are from 5 random seeds.
  • ...and 20 more figures