Weight Clipping for Deep Continual and Reinforcement Learning

Mohamed Elsayed; Qingfeng Lan; Clare Lyle; A. Rupam Mahmood

Weight Clipping for Deep Continual and Reinforcement Learning

Mohamed Elsayed, Qingfeng Lan, Clare Lyle, A. Rupam Mahmood

TL;DR

This work introduces weight clipping as a lightweight, optimizer-agnostic technique to bound weight magnitudes and improve learning under non-stationarity in both streaming supervised and reinforcement learning settings. By clipping weights to $[-\kappa s_l, \kappa s_l]$ based on the initialization range $s_l$, the method enforces Lipschitz continuity and bounded updates, which helps generalization, preserves plasticity in streaming tasks, mitigates policy collapse in PPO, and boosts sample efficiency when replay ratios are large. Empirical results across CIFAR-10 warm-start, input/label permutation streaming tasks, MuJoCo PPO, and Atari DQN/Rainbow demonstrate consistent gains in generalization, stability, and learning efficiency. The findings suggest weight clipping as a practical, low-overhead addition that can complement existing optimization methods without architectural changes, with potential future directions including adaptive clipping and Lipschitz-based guarantees.

Abstract

Many failures in deep continual and reinforcement learning are associated with increasing magnitudes of the weights, making them hard to change and potentially causing overfitting. While many methods address these learning failures, they often change the optimizer or the architecture, a complexity that hinders widespread adoption in various systems. In this paper, we focus on learning failures that are associated with increasing weight norm and we propose a simple technique that can be easily added on top of existing learning systems: clipping neural network weights to limit them to a specific range. We study the effectiveness of weight clipping in a series of supervised and reinforcement learning experiments. Our empirical results highlight the benefits of weight clipping for generalization, addressing loss of plasticity and policy collapse, and facilitating learning with a large replay ratio.

Weight Clipping for Deep Continual and Reinforcement Learning

TL;DR

based on the initialization range

, the method enforces Lipschitz continuity and bounded updates, which helps generalization, preserves plasticity in streaming tasks, mitigates policy collapse in PPO, and boosts sample efficiency when replay ratios are large. Empirical results across CIFAR-10 warm-start, input/label permutation streaming tasks, MuJoCo PPO, and Atari DQN/Rainbow demonstrate consistent gains in generalization, stability, and learning efficiency. The findings suggest weight clipping as a practical, low-overhead addition that can complement existing optimization methods without architectural changes, with potential future directions including adaptive clipping and Lipschitz-based guarantees.

Abstract

Paper Structure (20 sections, 2 theorems, 6 equations, 15 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 2 theorems, 6 equations, 15 figures, 2 tables, 1 algorithm.

Introduction
Problem Formulation
Streaming Supervised Learning
Reinforcement Learning
Method
Experiments
Weight Clipping for Improved Generalization
Weight Clipping in Streaming Learning
Weight Clipping Against Policy Collapse
Weight Clipping with Large Replay Ratios
Related Works
Conclusion
Proofs
Proof of Theorem \ref{['thm:smoothness']}
Proof of Corollary \ref{['corollary:boundedness']}
...and 5 more sections

Key Result

Theorem 1

Smoothness of Clipped Networks. Consider a fully-connected neural network $f_{\mathcal{W}}: \mathcal{X} \rightarrow \mathcal{Y}$ parametrized by the set of augmented weight matrices (include biases) $\mathcal{W}_{\text{Aug}}=\{{\bm{W}}_1,\dots,{\bm{W}}_L \}$. If the activation function $\sigma$ used

Figures (15)

Figure 1: Weight clipping confines the weights in restricted space while L2 Init pulls the current weight ${\bm{w}}_t$ to the weight at initialization ${\bm{w}}_0$ and L2 pulls the current weight ${\bm{w}}_t$ to the zero vector.
Figure 2: Performance when training on the data sequentially against when data is aggregated. The shaded region represents the standard error.
Figure 3: Performance of Adam and SGD with Weight Clipping on Input-permuted MNIST, Label-Permuted EMNIST, and Label-Permuted mini-ImageNet. All curves are averaged over $20$ independent runs. The shaded area represents the standard error
Figure 4: Diagnostic Statistics of different methods in Input-permuted MNIST. We show the online loss, the online plasticity, the $\ell_2$-norm of gradients, and the $\ell_2$-norm of weights.
Figure 5: Policy Collapse in PPO. The performance of PPO with Adam drops when trained for longer in contrast to Adam+WC, which can keep improving its performance. All curves are averaged over $30$ independent runs. The shaded area represents the standard error.
...and 10 more figures

Theorems & Definitions (4)

Theorem 1
Corollary 1
proof
proof

Weight Clipping for Deep Continual and Reinforcement Learning

TL;DR

Abstract

Weight Clipping for Deep Continual and Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (4)