Table of Contents
Fetching ...

Hyperspherical Normalization for Scalable Deep Reinforcement Learning

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo

TL;DR

This work tackles the challenge of scaling deep reinforcement learning by addressing non-stationarity and norm instabilities that hinder large-model performance. It introduces SimbaV2, a unifying architecture that enforces weight, feature, and gradient norms on the unit hypersphere while employing a distributional critic with reward scaling to stabilize gradients across varying reward magnitudes. Through a shift-aware hyperspherical input embedding, a residual-on-hypersphere encoding with learnable interpolation, and a KL-based distributional critic, SimbaV2 achieves state-of-the-art results across 57 continuous-control tasks and scales effectively with model size and compute. The approach yields robust online and offline RL performance with reduced need for periodic reinitialization or extensive hyperparameter tuning, highlighting a promising direction for scalable RL in real-world applications.

Abstract

Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at https://dojeon-ai.github.io/SimbaV2.

Hyperspherical Normalization for Scalable Deep Reinforcement Learning

TL;DR

This work tackles the challenge of scaling deep reinforcement learning by addressing non-stationarity and norm instabilities that hinder large-model performance. It introduces SimbaV2, a unifying architecture that enforces weight, feature, and gradient norms on the unit hypersphere while employing a distributional critic with reward scaling to stabilize gradients across varying reward magnitudes. Through a shift-aware hyperspherical input embedding, a residual-on-hypersphere encoding with learnable interpolation, and a KL-based distributional critic, SimbaV2 achieves state-of-the-art results across 57 continuous-control tasks and scales effectively with model size and compute. The approach yields robust online and offline RL performance with reduced need for periodic reinitialization or extensive hyperparameter tuning, highlighting a promising direction for scalable RL in real-world applications.

Abstract

Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at https://dojeon-ai.github.io/SimbaV2.

Paper Structure

This paper contains 56 sections, 41 equations, 21 figures, 35 tables.

Figures (21)

  • Figure 1: Compute vs RL Performance. Performance scales with increased compute when using Soft Actor-Critic with SimbaV2 architecture, outperforming other state-of-the-art RL algorithms. SimbaV2 achieves 0.848 normalized return with an update-to-data (UTD) ratio of 1, surpassing TD-MPC2 (0.749 at UTD=1), Simba (0.818 at UTD=8), and BRO (0.807 at UTD=8). Grey numbers below each point indicate the UTD ratio. Results are averaged over 57 continuous control tasks from MuJoCo, DMC, MyoSuite, and HumanoidBench, each trained on 1 million samples.
  • Figure 2: Benchmark Summary.(a) SimbaV2, with an update-to-data (UTD) ratio of 2, outperforms state-of-the-art RL algorithms across diverse continuous control benchmarks using fixed hyperparameters across all domains. (b) SimbaV2 delivers competitive performance in both online and offline RL while requiring significantly less training computation and offering faster inference times.
  • Figure 3: SimbaV2 architecture. The input observation is first normalized using running statistics, then shifted along a new axis with a constant $c_\text{shift}$ to preserve magnitude information before being projected onto the unit hypersphere. The projected observation is passed through a linear layer, followed by a series of non-linear blocks and refined with LERP, serving as a residual connection. A final linear layer predicts the policy or value function.
  • Figure 4: SimbaV2 vs. Simba Training Dynamics. We track 4 metrics during training to understand the learning dynamics of SimbaV2: (a) Average normalized return across tasks. (b) Weighted sum of $\ell_2$-norms of all intermediate features in critic. (c) Weighted sum of $\ell_2$-norms of all critic parameters (d) Weighted sum of $\ell_2$-norms of all gradients in critic (e) Effective learning rate (ELR) of the critic. On both environments, SimbaV2 maintains stable norms and ELR, while Simba exhibits divergent fluctuations.
  • Figure 5: Width Scaling. We scale the number of model parameters by increasing the width of the critic network. On DMC-Hard, both Simba and SimbaV2 benefit from increased model size. On HBench-Hard, however, Simba plateaus at larger model sizes, whereas SimbaV2 continues to improve.
  • ...and 16 more figures

Theorems & Definitions (2)

  • Definition 1.1: Retraction
  • Definition 7.1: Effective Learning Rate