Decoupled Weight Decay for Any $p$ Norm
Nadav Joseph Outmezguine, Noam Levi
TL;DR
This work addresses the resource-intensive training of large neural networks by introducing a decoupled weight decay scheme for $L_p$ regularization, enabling highly sparse models while preserving generalization. It derives a bi-convex reformulation with auxiliary variables and presents the proximal-gradient based $p$-norm Weight Decay ($p$WD) update: $w \leftarrow (w - \alpha \nabla \mathcal{L})/(1 + \alpha \lambda_p |w|^{p-2})$ with $s = |w|^{p-2}$, integrating smoothly with adaptive optimizers. Empirically, $p$WD achieves extremely high sparsity (up to ~99.5%) on CIFAR-10 and Tiny Shakespeare with accuracy rivaling AdamW, while revealing that sparsity is strongest for $p<1$ and generalization peaks around $1<p<2$; the study also discusses limitations and extensions such as $s$-dynamics, $p$-scheduling, and elastic-net hybrids. Overall, $p$WD provides a practical, low-overhead path to sparse training, offering comparable performance to state-of-the-art pruning methods and enabling energy- and memory-efficient deployment with potential for broader optimization contexts.
Abstract
With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.
