Table of Contents
Fetching ...

Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Itay Hubara, Brian Chmiel, Moshe Island, Ron Banner, Seffi Naor, Daniel Soudry

TL;DR

Deep neural networks suffer from high training and deployment costs. The authors propose mask diversity to rank sparsity patterns and a transposable N:M mask that accelerates both forward and backward passes, with mask discovery framed as a min-cost-flow problem and a fast 2-approximation for dynamic training. They demonstrate about 2x speedups in matrix multiplications on vision and language models without accuracy degradation and introduce AdaPrune to convert unstructured sparsity to N:M with minimal retraining. The work offers practical pathways toward hardware-friendly sparse training and flexible cross-device deployment.

Abstract

Unstructured pruning reduces the memory footprint in deep neural networks (DNNs). Recently, researchers proposed different types of structural pruning intending to reduce also the computation complexity. In this work, we first suggest a new measure called mask-diversity which correlates with the expected accuracy of the different types of structural pruning. We focus on the recently suggested N:M fine-grained block sparsity mask, in which for each block of M weights, we have at least N zeros. While N:M fine-grained block sparsity allows acceleration in actual modern hardware, it can be used only to accelerate the inference phase. In order to allow for similar accelerations in the training phase, we suggest a novel transposable fine-grained sparsity mask, where the same mask can be used for both forward and backward passes. Our transposable mask guarantees that both the weight matrix and its transpose follow the same sparsity pattern; thus, the matrix multiplication required for passing the error backward can also be accelerated. We formulate the problem of finding the optimal transposable-mask as a minimum-cost flow problem. Additionally, to speed up the minimum-cost flow computation, we also introduce a fast linear-time approximation that can be used when the masks dynamically change during training. Our experiments suggest a 2x speed-up in the matrix multiplications with no accuracy degradation over vision and language models. Finally, to solve the problem of switching between different structure constraints, we suggest a method to convert a pre-trained model with unstructured sparsity to an N:M fine-grained block sparsity model with little to no training. A reference implementation can be found at https://github.com/papers-submission/structured_transposable_masks.

Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

TL;DR

Deep neural networks suffer from high training and deployment costs. The authors propose mask diversity to rank sparsity patterns and a transposable N:M mask that accelerates both forward and backward passes, with mask discovery framed as a min-cost-flow problem and a fast 2-approximation for dynamic training. They demonstrate about 2x speedups in matrix multiplications on vision and language models without accuracy degradation and introduce AdaPrune to convert unstructured sparsity to N:M with minimal retraining. The work offers practical pathways toward hardware-friendly sparse training and flexible cross-device deployment.

Abstract

Unstructured pruning reduces the memory footprint in deep neural networks (DNNs). Recently, researchers proposed different types of structural pruning intending to reduce also the computation complexity. In this work, we first suggest a new measure called mask-diversity which correlates with the expected accuracy of the different types of structural pruning. We focus on the recently suggested N:M fine-grained block sparsity mask, in which for each block of M weights, we have at least N zeros. While N:M fine-grained block sparsity allows acceleration in actual modern hardware, it can be used only to accelerate the inference phase. In order to allow for similar accelerations in the training phase, we suggest a novel transposable fine-grained sparsity mask, where the same mask can be used for both forward and backward passes. Our transposable mask guarantees that both the weight matrix and its transpose follow the same sparsity pattern; thus, the matrix multiplication required for passing the error backward can also be accelerated. We formulate the problem of finding the optimal transposable-mask as a minimum-cost flow problem. Additionally, to speed up the minimum-cost flow computation, we also introduce a fast linear-time approximation that can be used when the masks dynamically change during training. Our experiments suggest a 2x speed-up in the matrix multiplications with no accuracy degradation over vision and language models. Finally, to solve the problem of switching between different structure constraints, we suggest a method to convert a pre-trained model with unstructured sparsity to an N:M fine-grained block sparsity model with little to no training. A reference implementation can be found at https://github.com/papers-submission/structured_transposable_masks.

Paper Structure

This paper contains 28 sections, 2 theorems, 8 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Lemma 1

Algorithm 1 produces a tight 2-approximate solution, i.e., $W(P)<2\cdot W^{*}$.

Figures (7)

  • Figure 1: High-level overview of the different questions and their corresponding solutions proposed in this work. Motivated by understanding fine-grained sparsity we first suggest a measure to rank different sparsity mask, then we suggest a method to accelerate training with fine-grained sparsity and finally propose a method to change the fine-grained mask without re-training.
  • Figure 3: (a): ResNet18 over Cifar100 top-1 accuracy for weight sparsity of 50 % using different structured and unstructured masks. As expected the mask diversity correlates with the pruned model accuracy. (b): Magnitude of the last layer's weight tensor of ResNet-50 (pretrained dense model) masked with structured mask 4:8, 2:4, 4:8 transposable ("4:8-T") and 4:8 sequential ("4:8-S") normalized by the unstructured 50% sparsity ("US"). Notice that mask diversity is correlated with magnitude preservation. As expected the 4:8 transposable mask has a similar $\ell_1$ norm score as the 2:4 mask. Additional results and details in \ref{['app:MD_exp']} .
  • Figure 4: $\frac{M}{2}:M$ transposable-sparsity optimization as a min-cost flow problem. In addition to a source and a sink, the network has a node for each row and for each column. The construction uses three types of edges: (i) source edges emanating from the source node $s$ into each row node $i$; (ii) sink edges connecting each column node $j$ with the sink node $t$; and (iii) a coefficient edge$(i,j)$ for each matrix element $W_{i,j}$. Each source edge $(s,i)$ has capacity $\frac{M}{2}$ which is equal to the number of elements that need to be selected for pruning in row $i$. Similarly, each sink edge $(j,t)$ has capacity $\frac{M}{2}$ which is equal to the number of elements pruned in column $j$. Each coefficient edge $(i,j)$ has unit capacity and cost $\left| W_{i,j}\right|$. Finally, selecting a matrix element with weight $W_{i,j}$ for pruning corresponds to a unit flow on the coefficient edge $(i,j)$. Assuming the source and sink edges have zero-cost, there is a one-to-one correspondence between a min-cost flow solution that sends a flow of value $\frac{M^2}{2}$ from the source $s$ to the destination $t$ in this construction, and an optimal transposable mask minimizing the sum of absolute values selected for pruning.
  • Figure 5: (a):\ref{['eq:P_sparse']} for $\rho=0.5$ and various block sizes $M$. We have a sharp ("phase") transition at $N/M = \rho$. Specifically, (i) when $N/M \leq \rho$ we have a probability larger than 0.5 that the sampled block is $N:M$ sparse; (ii) when $N/M \geq \rho$ this probability quickly decreases to zero. As block size $M$ increases this phase-transition gets sharper. As expected, when $M \to \infty$, unstructured sparsity satisfies the structured constraints, and we expect it to display the phase transition precisely at the critical point $\rho$. (b): Top-1 accuracy vs. percent of constraints violated. The numbers next to the baseline samples represents the sparsity level of the refined model.
  • Figure A.1: (a): Block of size $4 \times 4$ where we want to zero out one element in each row and column using the 2-approximation algorithm. (b): The block represented as a directed bipartite graph. (c): The 7 iterations of the 2-approximation algorithm on the bipartite graph. Notice that we get an approximation ratio of $\frac{7}{4}$, since the optimal solution picks only the diagonal entries.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Lemma
  • Lemma
  • proof