Table of Contents
Fetching ...

Weight Clipping for Deep Continual and Reinforcement Learning

Mohamed Elsayed, Qingfeng Lan, Clare Lyle, A. Rupam Mahmood

TL;DR

This work introduces weight clipping as a lightweight, optimizer-agnostic technique to bound weight magnitudes and improve learning under non-stationarity in both streaming supervised and reinforcement learning settings. By clipping weights to $[-\kappa s_l, \kappa s_l]$ based on the initialization range $s_l$, the method enforces Lipschitz continuity and bounded updates, which helps generalization, preserves plasticity in streaming tasks, mitigates policy collapse in PPO, and boosts sample efficiency when replay ratios are large. Empirical results across CIFAR-10 warm-start, input/label permutation streaming tasks, MuJoCo PPO, and Atari DQN/Rainbow demonstrate consistent gains in generalization, stability, and learning efficiency. The findings suggest weight clipping as a practical, low-overhead addition that can complement existing optimization methods without architectural changes, with potential future directions including adaptive clipping and Lipschitz-based guarantees.

Abstract

Many failures in deep continual and reinforcement learning are associated with increasing magnitudes of the weights, making them hard to change and potentially causing overfitting. While many methods address these learning failures, they often change the optimizer or the architecture, a complexity that hinders widespread adoption in various systems. In this paper, we focus on learning failures that are associated with increasing weight norm and we propose a simple technique that can be easily added on top of existing learning systems: clipping neural network weights to limit them to a specific range. We study the effectiveness of weight clipping in a series of supervised and reinforcement learning experiments. Our empirical results highlight the benefits of weight clipping for generalization, addressing loss of plasticity and policy collapse, and facilitating learning with a large replay ratio.

Weight Clipping for Deep Continual and Reinforcement Learning

TL;DR

This work introduces weight clipping as a lightweight, optimizer-agnostic technique to bound weight magnitudes and improve learning under non-stationarity in both streaming supervised and reinforcement learning settings. By clipping weights to based on the initialization range , the method enforces Lipschitz continuity and bounded updates, which helps generalization, preserves plasticity in streaming tasks, mitigates policy collapse in PPO, and boosts sample efficiency when replay ratios are large. Empirical results across CIFAR-10 warm-start, input/label permutation streaming tasks, MuJoCo PPO, and Atari DQN/Rainbow demonstrate consistent gains in generalization, stability, and learning efficiency. The findings suggest weight clipping as a practical, low-overhead addition that can complement existing optimization methods without architectural changes, with potential future directions including adaptive clipping and Lipschitz-based guarantees.

Abstract

Many failures in deep continual and reinforcement learning are associated with increasing magnitudes of the weights, making them hard to change and potentially causing overfitting. While many methods address these learning failures, they often change the optimizer or the architecture, a complexity that hinders widespread adoption in various systems. In this paper, we focus on learning failures that are associated with increasing weight norm and we propose a simple technique that can be easily added on top of existing learning systems: clipping neural network weights to limit them to a specific range. We study the effectiveness of weight clipping in a series of supervised and reinforcement learning experiments. Our empirical results highlight the benefits of weight clipping for generalization, addressing loss of plasticity and policy collapse, and facilitating learning with a large replay ratio.
Paper Structure (20 sections, 2 theorems, 6 equations, 15 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 2 theorems, 6 equations, 15 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Smoothness of Clipped Networks. Consider a fully-connected neural network $f_{\mathcal{W}}: \mathcal{X} \rightarrow \mathcal{Y}$ parametrized by the set of augmented weight matrices (include biases) $\mathcal{W}_{\text{Aug}}=\{{\bm{W}}_1,\dots,{\bm{W}}_L \}$. If the activation function $\sigma$ used

Figures (15)

  • Figure 1: Weight clipping confines the weights in restricted space while L2 Init pulls the current weight ${\bm{w}}_t$ to the weight at initialization ${\bm{w}}_0$ and L2 pulls the current weight ${\bm{w}}_t$ to the zero vector.
  • Figure 2: Performance when training on the data sequentially against when data is aggregated. The shaded region represents the standard error.
  • Figure 3: Performance of Adam and SGD with Weight Clipping on Input-permuted MNIST, Label-Permuted EMNIST, and Label-Permuted mini-ImageNet. All curves are averaged over $20$ independent runs. The shaded area represents the standard error
  • Figure 4: Diagnostic Statistics of different methods in Input-permuted MNIST. We show the online loss, the online plasticity, the $\ell_2$-norm of gradients, and the $\ell_2$-norm of weights.
  • Figure 5: Policy Collapse in PPO. The performance of PPO with Adam drops when trained for longer in contrast to Adam+WC, which can keep improving its performance. All curves are averaged over $30$ independent runs. The shaded area represents the standard error.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Corollary 1
  • proof
  • proof