Table of Contents
Fetching ...

Mask in the Mirror: Implicit Sparsification

Tom Jacobs, Rebekka Burkholz

TL;DR

This work tackles the high cost of large neural networks by analyzing continuous sparsification using the $m \odot w$ parameterization, revealing that training dynamics induce an implicit $L_2$ bias that transitions to an $L_1$ bias over time. It extends the mirror-flow framework to a time-dependent Bregman potential $R_{a_t}$, enabling explicit control over this implicit bias through a dynamic regularization schedule $\alpha_t$ and a sign-flip-friendly initialization. The authors prove convergence and, in underdetermined linear regression, optimality results under the time-dependent setting, and instantiate PILoT, a practical algorithm that dynamically tunes regularization to achieve superior sparsity-accuracy trade-offs. Empirically, PILoT outperforms state-of-the-art continuous sparsification baselines on CIFAR and ImageNet, and shows optimality in diagonal linear networks, supporting the proposed theoretical framework. Overall, the work provides a principled route to controllable implicit regularization in sparsification with meaningful implications for scalable neural networks.

Abstract

Continuous sparsification strategies are among the most effective methods for reducing the inference costs and memory demands of large-scale neural networks. A key factor in their success is the implicit $L_1$ regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit $L_1$ regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit $L_2$ regularization that gradually transitions to an $L_1$ penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings may be of independent interest, as we demonstrate how to enter the rich regime and show that the implicit bias can be controlled via a time-dependent Bregman potential. To validate these insights, we introduce PILoT, a continuous sparsification approach with novel initialization and dynamic regularization, which consistently outperforms baselines in standard experiments.

Mask in the Mirror: Implicit Sparsification

TL;DR

This work tackles the high cost of large neural networks by analyzing continuous sparsification using the parameterization, revealing that training dynamics induce an implicit bias that transitions to an bias over time. It extends the mirror-flow framework to a time-dependent Bregman potential , enabling explicit control over this implicit bias through a dynamic regularization schedule and a sign-flip-friendly initialization. The authors prove convergence and, in underdetermined linear regression, optimality results under the time-dependent setting, and instantiate PILoT, a practical algorithm that dynamically tunes regularization to achieve superior sparsity-accuracy trade-offs. Empirically, PILoT outperforms state-of-the-art continuous sparsification baselines on CIFAR and ImageNet, and shows optimality in diagonal linear networks, supporting the proposed theoretical framework. Overall, the work provides a principled route to controllable implicit regularization in sparsification with meaningful implications for scalable neural networks.

Abstract

Continuous sparsification strategies are among the most effective methods for reducing the inference costs and memory demands of large-scale neural networks. A key factor in their success is the implicit regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit regularization that gradually transitions to an penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings may be of independent interest, as we demonstrate how to enter the rich regime and show that the implicit bias can be controlled via a time-dependent Bregman potential. To validate these insights, we introduce PILoT, a continuous sparsification approach with novel initialization and dynamic regularization, which consistently outperforms baselines in standard experiments.
Paper Structure (18 sections, 14 theorems, 52 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 14 theorems, 52 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Theorem 2.1

Let $|w_{0,i}| < m_{0,i}$ for all $i \in [n]$, the time-dependent Bregman potential is given by with $a_{t,i} = 2u_{0,i} v_{0,i} \text{exp}\left(- 2\int_0^t \alpha_s ds\right)$ and $u_{0,i} = \frac{m_{0,i} + w_{0,i}}{\sqrt{2}}$ and $v_{0,i} = \frac{m_{0,i} -w_{0,i}}{\sqrt{2}}$. The gradient flow of $x_t = m_t \odot w_t$ induced by Eq. (PILoT: opt problem) then satisfies

Figures (6)

  • Figure 1: Evolution of the time-dependent Bregman potential. $\alpha = \int_0^t \alpha_s ds$ is the exponent of $a_t$.
  • Figure 2: A simulation of gradient flow on a diagonal linear network is given for the different regularizations.
  • Figure 3: One-shot sparsification. Acc. versus sparsity for CIFAR10 (left) and CIFAR100 (right).
  • Figure 4: Learning Rate Rewinding (LRR) and Weight Rewinding (WR) with PILoT on ImageNet ResNet-18. The left plot is the complete plot and the right plot is a zoomed-in version.
  • Figure 5: All runs for the diagonal linear network. From left to right $m \odot w$ with PILoT initalization, $m \odot w$ with spred initialization, and $x$ with $L_1$ regularization
  • ...and 1 more figures

Theorems & Definitions (18)

  • Theorem 2.1
  • Remark 2.1
  • Theorem 2.2
  • Remark 2.2
  • Theorem 2.3
  • Remark 3.1
  • Theorem A.1
  • Remark A.1
  • Theorem A.2
  • Theorem A.3
  • ...and 8 more