Mask in the Mirror: Implicit Sparsification
Tom Jacobs, Rebekka Burkholz
TL;DR
This work tackles the high cost of large neural networks by analyzing continuous sparsification using the $m \odot w$ parameterization, revealing that training dynamics induce an implicit $L_2$ bias that transitions to an $L_1$ bias over time. It extends the mirror-flow framework to a time-dependent Bregman potential $R_{a_t}$, enabling explicit control over this implicit bias through a dynamic regularization schedule $\alpha_t$ and a sign-flip-friendly initialization. The authors prove convergence and, in underdetermined linear regression, optimality results under the time-dependent setting, and instantiate PILoT, a practical algorithm that dynamically tunes regularization to achieve superior sparsity-accuracy trade-offs. Empirically, PILoT outperforms state-of-the-art continuous sparsification baselines on CIFAR and ImageNet, and shows optimality in diagonal linear networks, supporting the proposed theoretical framework. Overall, the work provides a principled route to controllable implicit regularization in sparsification with meaningful implications for scalable neural networks.
Abstract
Continuous sparsification strategies are among the most effective methods for reducing the inference costs and memory demands of large-scale neural networks. A key factor in their success is the implicit $L_1$ regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit $L_1$ regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit $L_2$ regularization that gradually transitions to an $L_1$ penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings may be of independent interest, as we demonstrate how to enter the rich regime and show that the implicit bias can be controlled via a time-dependent Bregman potential. To validate these insights, we introduce PILoT, a continuous sparsification approach with novel initialization and dynamic regularization, which consistently outperforms baselines in standard experiments.
