Hyperspherical Normalization for Scalable Deep Reinforcement Learning
Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo
TL;DR
This work tackles the challenge of scaling deep reinforcement learning by addressing non-stationarity and norm instabilities that hinder large-model performance. It introduces SimbaV2, a unifying architecture that enforces weight, feature, and gradient norms on the unit hypersphere while employing a distributional critic with reward scaling to stabilize gradients across varying reward magnitudes. Through a shift-aware hyperspherical input embedding, a residual-on-hypersphere encoding with learnable interpolation, and a KL-based distributional critic, SimbaV2 achieves state-of-the-art results across 57 continuous-control tasks and scales effectively with model size and compute. The approach yields robust online and offline RL performance with reduced need for periodic reinitialization or extensive hyperparameter tuning, highlighting a promising direction for scalable RL in real-world applications.
Abstract
Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at https://dojeon-ai.github.io/SimbaV2.
