Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization

Aditya Biswas

Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization

Aditya Biswas

TL;DR

This work investigates capacity control in overparameterized neural networks by marrying $L_1$ weight normalization with the 1-path-norm regularization. The authors introduce PSiLON Net and PSiLON ResNet, which share length parameters and leverage a simplified or improved 1-path-norm bound, enabling efficient optimization and a bias toward near-sparse solutions. They also develop a pruning mechanism that achieves exact sparsity without sacrificing performance, and they validate the approach on small tabular datasets and deep ablations, demonstrating competitive generalization and favorable training dynamics. Collectively, the approach offers a practical pathway to controllable, generalizable networks in data-scarce regimes, with reduced computational burden and strong empirical results.

Abstract

We present PSiLON Net, an MLP architecture that uses $L_1$ weight normalization for each weight vector and shares the length parameter across the layer. The 1-path-norm provides a bound for the Lipschitz constant of a neural network and reflects on its generalizability, and we show how PSiLON Net's design drastically simplifies the 1-path-norm, while providing an inductive bias towards efficient learning and near-sparse parameters. We propose a pruning method to achieve exact sparsity in the final stages of training, if desired. To exploit the inductive bias of residual networks, we present a simplified residual block, leveraging concatenated ReLU activations. For networks constructed with such blocks, we prove that considering only a subset of possible paths in the 1-path-norm is sufficient to bound the Lipschitz constant. Using the 1-path-norm and this improved bound as regularizers, we conduct experiments in the small data regime using overparameterized PSiLON Nets and PSiLON ResNets, demonstrating reliable optimization and strong performance.

Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization

TL;DR

This work investigates capacity control in overparameterized neural networks by marrying

weight normalization with the 1-path-norm regularization. The authors introduce PSiLON Net and PSiLON ResNet, which share length parameters and leverage a simplified or improved 1-path-norm bound, enabling efficient optimization and a bias toward near-sparse solutions. They also develop a pruning mechanism that achieves exact sparsity without sacrificing performance, and they validate the approach on small tabular datasets and deep ablations, demonstrating competitive generalization and favorable training dynamics. Collectively, the approach offers a practical pathway to controllable, generalizable networks in data-scarce regimes, with reduced computational burden and strong empirical results.

Abstract

We present PSiLON Net, an MLP architecture that uses

weight normalization for each weight vector and shares the length parameter across the layer. The 1-path-norm provides a bound for the Lipschitz constant of a neural network and reflects on its generalizability, and we show how PSiLON Net's design drastically simplifies the 1-path-norm, while providing an inductive bias towards efficient learning and near-sparse parameters. We propose a pruning method to achieve exact sparsity in the final stages of training, if desired. To exploit the inductive bias of residual networks, we present a simplified residual block, leveraging concatenated ReLU activations. For networks constructed with such blocks, we prove that considering only a subset of possible paths in the 1-path-norm is sufficient to bound the Lipschitz constant. Using the 1-path-norm and this improved bound as regularizers, we conduct experiments in the small data regime using overparameterized PSiLON Nets and PSiLON ResNets, demonstrating reliable optimization and strong performance.

Paper Structure (34 sections, 2 theorems, 21 equations, 1 figure, 3 tables, 3 algorithms)

This paper contains 34 sections, 2 theorems, 21 equations, 1 figure, 3 tables, 3 algorithms.

Introduction
Contributions
Notation
$L_1$ Weight Normalization
Inductive Bias
Weight Pruning
1-Path-Norm
Residual CReLU Networks
CReLU Residual Block
Improved Bound
Bias Parameters
Regularization
MLPs
CReLU Residual Networks
Remarks
...and 19 more sections

Key Result

Theorem 1

Suppose the subgradients of $\sigma$ and $\sigma_\text{out}$ are globally bounded between zero and one. Let $\mathcal{L}_\mathbf{W}$ denote the Lipschitz constant of the neural network $f_\mathbf{W}$ with respect to the $L_\infty$ and $L_1$ norms for the input and output spaces, respectively. Then,

Figures (1)

Figure 1: Validation cross-entropy curves on the Higgs dataset under varying regularization levels indicated by color. Using $L_2$ WR, results are presented for S-Net without WN (top) and S-Net with WN (middle), Using 1-path-norm regularization, results are presented for P-Net (bottom).

Theorems & Definitions (4)

Definition 1: Near-Sparsity
Theorem 1
Definition 2: CReLU Residual Network
Theorem 2

Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization

TL;DR

Abstract

Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (4)