Table of Contents
Fetching ...

Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries

Chris Kolb, Tobias Weber, Bernd Bischl, David Rügamer

TL;DR

This work tackles the challenge of inducing sparsity in neural networks with SGD by turning non-differentiable $L_1$ regularization into a differentiable framework through deep weight factorization (DWF). It theoretically proves an equivalence between a deep factorized objective and a non-convex $L_{2/D}$ regularization on the collapsed weights for depth $D\geq 2$, enabling standard SGD to find sparse solutions via $L_2$ penalties on the factor matrices. The authors provide a tailored initialization (VarMatch with interval truncation) and learning-rate strategies to stabilize training, and they reveal three distinct learning phases and delayed generalization tied to regularization and depth. Empirically, DWF consistently outperforms shallow factorization and many pruning baselines across diverse architectures and datasets, delivering high compression with minimal runtime overhead, and exhibiting adaptive layer-wise sparsity budgets that avoid catastrophic layer pruning. These results imply a practical and theoretically grounded route to efficient, sparse deep models without post-hoc pruning.

Abstract

Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the $L_1$ norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of $L_1$-penalized neural networks by adding differentiable $L_2$ regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.

Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries

TL;DR

This work tackles the challenge of inducing sparsity in neural networks with SGD by turning non-differentiable regularization into a differentiable framework through deep weight factorization (DWF). It theoretically proves an equivalence between a deep factorized objective and a non-convex regularization on the collapsed weights for depth , enabling standard SGD to find sparse solutions via penalties on the factor matrices. The authors provide a tailored initialization (VarMatch with interval truncation) and learning-rate strategies to stabilize training, and they reveal three distinct learning phases and delayed generalization tied to regularization and depth. Empirically, DWF consistently outperforms shallow factorization and many pruning baselines across diverse architectures and datasets, delivering high compression with minimal runtime overhead, and exhibiting adaptive layer-wise sparsity budgets that avoid catastrophic layer pruning. These results imply a practical and theoretically grounded route to efficient, sparse deep models without post-hoc pruning.

Abstract

Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of -penalized neural networks by adding differentiable regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.

Paper Structure

This paper contains 66 sections, 6 theorems, 40 equations, 27 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

Let $\bm{\omega} = (\bm{\omega}_1, \ldots, \bm{\omega}_D) \in \mathbb{R}^{Dp}$ be a local minimizer of $\mathcal{L}_{\bm{\omega},\lambda}(\bm{\omega})$. Then i) $|\omega_{j,1}| = \ldots = |\omega_{j,D}|\,$ for all $j \in [p]$, and ii) the factor $L_2$ penalty reduces to $D^{-1} \sum_{d=1}^D \|\bm{\o

Figures (27)

  • Figure 1: Sparsity-accuracy tradeoff using a vanilla $L_1$ penalization with SGD (blue) compared to (deep) weight factorization. Means and std. deviations over 3 random seeds are shown.
  • Figure 2: Overview of the proposed method (cf. \ref{['alg:train']}). Our approach proceeds by factorizing the neural network weights and running SGD on the factors $\bm{\omega}_d$ with weight decay. Post-training, the factors are collapsed again, with the resulting sparse solutions being minimizers of the non-smooth $L_{2/D}$-regularized objective.
  • Figure 3: Scalar rescaling symmetry and min-norm factorizations.
  • Figure 4: DWF initialization strategies. Left: factor densities with variance matching and truncation. Middle: product densities for $D=4$ illustrating kurtosis explosion without truncation. Right: sparsity-accuracy curves for different initializations and $D$, showing the failure of standard initialization.
  • Figure 5: Failure modes when optimizing factorized neural networks.
  • ...and 22 more figures

Theorems & Definitions (17)

  • Definition 1: Rescaling Symmetry
  • Definition 2: Deep Weight Factorization
  • Lemma 1: Necessary condition for solution and minimum $L_2$ penalty
  • Theorem 1: Equivalence of optimization problems
  • Lemma 2: Standard initializations in factorized networks
  • Remark 1
  • proof
  • Definition 3: Standard Weight Initialization
  • proof
  • proof
  • ...and 7 more