Mask in the Mirror: Implicit Sparsification

Tom Jacobs; Rebekka Burkholz

Mask in the Mirror: Implicit Sparsification

Tom Jacobs, Rebekka Burkholz

TL;DR

This work tackles the high cost of large neural networks by analyzing continuous sparsification using the $m \odot w$ parameterization, revealing that training dynamics induce an implicit $L_2$ bias that transitions to an $L_1$ bias over time. It extends the mirror-flow framework to a time-dependent Bregman potential $R_{a_t}$, enabling explicit control over this implicit bias through a dynamic regularization schedule $\alpha_t$ and a sign-flip-friendly initialization. The authors prove convergence and, in underdetermined linear regression, optimality results under the time-dependent setting, and instantiate PILoT, a practical algorithm that dynamically tunes regularization to achieve superior sparsity-accuracy trade-offs. Empirically, PILoT outperforms state-of-the-art continuous sparsification baselines on CIFAR and ImageNet, and shows optimality in diagonal linear networks, supporting the proposed theoretical framework. Overall, the work provides a principled route to controllable implicit regularization in sparsification with meaningful implications for scalable neural networks.

Abstract

Continuous sparsification strategies are among the most effective methods for reducing the inference costs and memory demands of large-scale neural networks. A key factor in their success is the implicit $L_1$ regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit $L_1$ regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit $L_2$ regularization that gradually transitions to an $L_1$ penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings may be of independent interest, as we demonstrate how to enter the rich regime and show that the implicit bias can be controlled via a time-dependent Bregman potential. To validate these insights, we introduce PILoT, a continuous sparsification approach with novel initialization and dynamic regularization, which consistently outperforms baselines in standard experiments.

Mask in the Mirror: Implicit Sparsification

TL;DR

This work tackles the high cost of large neural networks by analyzing continuous sparsification using the

parameterization, revealing that training dynamics induce an implicit

bias that transitions to an

bias over time. It extends the mirror-flow framework to a time-dependent Bregman potential

, enabling explicit control over this implicit bias through a dynamic regularization schedule

and a sign-flip-friendly initialization. The authors prove convergence and, in underdetermined linear regression, optimality results under the time-dependent setting, and instantiate PILoT, a practical algorithm that dynamically tunes regularization to achieve superior sparsity-accuracy trade-offs. Empirically, PILoT outperforms state-of-the-art continuous sparsification baselines on CIFAR and ImageNet, and shows optimality in diagonal linear networks, supporting the proposed theoretical framework. Overall, the work provides a principled route to controllable implicit regularization in sparsification with meaningful implications for scalable neural networks.

Abstract

regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit

regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit

regularization that gradually transitions to an

penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings may be of independent interest, as we demonstrate how to enter the rich regime and show that the implicit bias can be controlled via a time-dependent Bregman potential. To validate these insights, we introduce PILoT, a continuous sparsification approach with novel initialization and dynamic regularization, which consistently outperforms baselines in standard experiments.

Paper Structure (18 sections, 14 theorems, 52 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 14 theorems, 52 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Related work
Controlling the implicit bias with explicit regularization
The algorithm: PILoT
Experiments
Diagonal Linear Network.
One-shot sparsification.
Iterative Pruning.
Discussion
Mirror flow framework
Proof main result
Discussion of the proof
Details experiments
Diagonal linear network
One-shot
...and 3 more sections

Key Result

Theorem 2.1

Let $|w_{0,i}| < m_{0,i}$ for all $i \in [n]$, the time-dependent Bregman potential is given by with $a_{t,i} = 2u_{0,i} v_{0,i} \text{exp}\left(- 2\int_0^t \alpha_s ds\right)$ and $u_{0,i} = \frac{m_{0,i} + w_{0,i}}{\sqrt{2}}$ and $v_{0,i} = \frac{m_{0,i} -w_{0,i}}{\sqrt{2}}$. The gradient flow of $x_t = m_t \odot w_t$ induced by Eq. (PILoT: opt problem) then satisfies

Figures (6)

Figure 1: Evolution of the time-dependent Bregman potential. $\alpha = \int_0^t \alpha_s ds$ is the exponent of $a_t$.
Figure 2: A simulation of gradient flow on a diagonal linear network is given for the different regularizations.
Figure 3: One-shot sparsification. Acc. versus sparsity for CIFAR10 (left) and CIFAR100 (right).
Figure 4: Learning Rate Rewinding (LRR) and Weight Rewinding (WR) with PILoT on ImageNet ResNet-18. The left plot is the complete plot and the right plot is a zoomed-in version.
Figure 5: All runs for the diagonal linear network. From left to right $m \odot w$ with PILoT initalization, $m \odot w$ with spred initialization, and $x$ with $L_1$ regularization
...and 1 more figures

Theorems & Definitions (18)

Theorem 2.1
Remark 2.1
Theorem 2.2
Remark 2.2
Theorem 2.3
Remark 3.1
Theorem A.1
Remark A.1
Theorem A.2
Theorem A.3
...and 8 more

Mask in the Mirror: Implicit Sparsification

TL;DR

Abstract

Mask in the Mirror: Implicit Sparsification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (18)