Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers

Shiwei Liu; Zhangyang Wang

Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers

Shiwei Liu, Zhangyang Wang

TL;DR

The paper surveys common confusions in sparse neural networks and clarifies distinctions across unstructured/structured sparsity, dense-to-sparse versus sparse-to-sparse training, and static versus dynamic masks. It catalogs ten key questions and provides precise guidance on interpretation, hardware relevance, and fair evaluation practices. By framing SNNs as a broad, emerging field, the work aims to help newcomers and established researchers communicate clearly, design better experiments, and compare methods more reliably. Overall, it emphasizes the practical and theoretical value of sparsity beyond traditional pruning and highlights pathways for scalable, hardware-aware research in sparse models.

Abstract

This article does not propose any novel algorithm or new hardware for sparsity. Instead, it aims to serve the "common good" for the increasingly prosperous Sparse Neural Network (SNN) research community. We attempt to summarize some most common confusions in SNNs, that one may come across in various scenarios such as paper review/rebuttal and talks - many drawn from the authors' own bittersweet experiences! We feel that doing so is meaningful and timely, since the focus of SNN research is notably shifting from traditional pruning to more diverse and profound forms of sparsity before, during, and after training. The intricate relationships between their scopes, assumptions, and approaches lead to misunderstandings, for non-experts or even experts in SNNs. In response, we summarize ten Q\&As of SNNs from many key aspects, including dense vs. sparse, unstructured sparse vs. structured sparse, pruning vs. sparse training, dense-to-sparse training vs. sparse-to-sparse training, static sparsity vs. dynamic sparsity, before-training/during-training vs. post-training sparsity, and many more. We strive to provide proper and generically applicable answers to clarify those confusions to the best extent possible. We hope our summary provides useful general knowledge for people who want to enter and engage with this exciting community; and also provides some "mind of ease" convenience for SNN researchers to explain their work in the right contexts. At the very least (and perhaps as this article's most insignificant target functionality), if you are writing/planning to write a paper or rebuttal in the field of SNNs, we hope some of our answers could help you!

Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers

TL;DR

Abstract

Paper Structure (14 sections, 1 figure)

This paper contains 14 sections, 1 figure.

Background on Sparsity in Neural Networks
Overview of Sparse Neural Networks as Emerging Research Field
Common Confusions in SNNs: Start from Ten Q&As
Why bother sparse neural networks, not just dense compact networks? (Commonly appearing in reviewer comments like: "Why don’t just use a smaller dense model, but rather (like an idiot) first start from a bigger model and then sparsifying it?")
What is the difference between unstructured and structured weight pruning? What is the difference between weight pruning and activation pruning?
Is structured pruning just channel pruning? Is channel pruning the only "practical meaningful" sparse structure on hardware?
Does weight pruning simply produce a smaller/narrower network? (Commonly appearing in reviewer comments like: "It makes no difference to specifically study a sparse network, because that is just a normal dense network in reduced width!")
Why study unstructured sparsity if it can not be accelerated on common GPUs? (Commonly appearing in reviewer comments like: "...the method is only demonstrated on unstructured sparsity and hence has no practical value!")
Are unstructured and structured sparse algorithms connected, or potentially "convertible" to each other?
What is the difference between dense-to-sparse training and sparse-to-sparse training? When choosing the former, and when the latter?
Now for sparse-to-sparse training: Is the sparse mask only static or only dynamic? If newly activated weights of dynamic sparse training are initialized with zero, the gradient will also be zero. Why does dynamic sparse training work? Is Dropout a DST method or not?
Is Lottery Ticket Hypothesis also a sparse-to-sparse training approach, and should PaI methods be asked to compare against it?
How to draw fair and trusted comparisons among different sparse algorithms?
Acknowledgement

Figures (1)

Figure 1: Neuron Pruning v.s. Weight Pruning on Fully-Connected Networks. The green color indicates being pruned. Neuron pruning results in width decrease while weight pruning results in sparser connection but no change in width.

Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers

TL;DR

Abstract

Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers

Authors

TL;DR

Abstract

Table of Contents

Figures (1)