Table of Contents
Fetching ...

Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization

Aditya Biswas

TL;DR

This work investigates capacity control in overparameterized neural networks by marrying $L_1$ weight normalization with the 1-path-norm regularization. The authors introduce PSiLON Net and PSiLON ResNet, which share length parameters and leverage a simplified or improved 1-path-norm bound, enabling efficient optimization and a bias toward near-sparse solutions. They also develop a pruning mechanism that achieves exact sparsity without sacrificing performance, and they validate the approach on small tabular datasets and deep ablations, demonstrating competitive generalization and favorable training dynamics. Collectively, the approach offers a practical pathway to controllable, generalizable networks in data-scarce regimes, with reduced computational burden and strong empirical results.

Abstract

We present PSiLON Net, an MLP architecture that uses $L_1$ weight normalization for each weight vector and shares the length parameter across the layer. The 1-path-norm provides a bound for the Lipschitz constant of a neural network and reflects on its generalizability, and we show how PSiLON Net's design drastically simplifies the 1-path-norm, while providing an inductive bias towards efficient learning and near-sparse parameters. We propose a pruning method to achieve exact sparsity in the final stages of training, if desired. To exploit the inductive bias of residual networks, we present a simplified residual block, leveraging concatenated ReLU activations. For networks constructed with such blocks, we prove that considering only a subset of possible paths in the 1-path-norm is sufficient to bound the Lipschitz constant. Using the 1-path-norm and this improved bound as regularizers, we conduct experiments in the small data regime using overparameterized PSiLON Nets and PSiLON ResNets, demonstrating reliable optimization and strong performance.

Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization

TL;DR

This work investigates capacity control in overparameterized neural networks by marrying weight normalization with the 1-path-norm regularization. The authors introduce PSiLON Net and PSiLON ResNet, which share length parameters and leverage a simplified or improved 1-path-norm bound, enabling efficient optimization and a bias toward near-sparse solutions. They also develop a pruning mechanism that achieves exact sparsity without sacrificing performance, and they validate the approach on small tabular datasets and deep ablations, demonstrating competitive generalization and favorable training dynamics. Collectively, the approach offers a practical pathway to controllable, generalizable networks in data-scarce regimes, with reduced computational burden and strong empirical results.

Abstract

We present PSiLON Net, an MLP architecture that uses weight normalization for each weight vector and shares the length parameter across the layer. The 1-path-norm provides a bound for the Lipschitz constant of a neural network and reflects on its generalizability, and we show how PSiLON Net's design drastically simplifies the 1-path-norm, while providing an inductive bias towards efficient learning and near-sparse parameters. We propose a pruning method to achieve exact sparsity in the final stages of training, if desired. To exploit the inductive bias of residual networks, we present a simplified residual block, leveraging concatenated ReLU activations. For networks constructed with such blocks, we prove that considering only a subset of possible paths in the 1-path-norm is sufficient to bound the Lipschitz constant. Using the 1-path-norm and this improved bound as regularizers, we conduct experiments in the small data regime using overparameterized PSiLON Nets and PSiLON ResNets, demonstrating reliable optimization and strong performance.
Paper Structure (34 sections, 2 theorems, 21 equations, 1 figure, 3 tables, 3 algorithms)

This paper contains 34 sections, 2 theorems, 21 equations, 1 figure, 3 tables, 3 algorithms.

Key Result

Theorem 1

Suppose the subgradients of $\sigma$ and $\sigma_\text{out}$ are globally bounded between zero and one. Let $\mathcal{L}_\mathbf{W}$ denote the Lipschitz constant of the neural network $f_\mathbf{W}$ with respect to the $L_\infty$ and $L_1$ norms for the input and output spaces, respectively. Then,

Figures (1)

  • Figure 1: Validation cross-entropy curves on the Higgs dataset under varying regularization levels indicated by color. Using $L_2$ WR, results are presented for S-Net without WN (top) and S-Net with WN (middle), Using 1-path-norm regularization, results are presented for P-Net (bottom).

Theorems & Definitions (4)

  • Definition 1: Near-Sparsity
  • Theorem 1
  • Definition 2: CReLU Residual Network
  • Theorem 2