Table of Contents
Fetching ...

Spatial Re-parameterization for N:M Sparsity

Yuxin Zhang, Mingbao Lin, Mingliang Xu, Yonghong Tian, Rongrong Ji

TL;DR

This work addresses the performance gap between structured N:M sparsity and unstructured sparsity by revealing a key property: unstructured sparsity exhibits spatial sparsity variability that benefits feature localization. The authors propose Spatial Re-parameterization (SpRe), which adds a train-time auxiliary branch that follows the unstructured spatial sparsity distribution and can be re-parameterized into the main N:M branch for inference, preserving the N:M pattern and incurring no extra inference cost. Through BN-enabled multi-branch optimization, SpRe maintains spatial sparsity diversity during training and yields consistent accuracy gains across CIFAR-10, ImageNet-1K, and COCO tasks, even matching or surpassing some unstructured sparsity methods at comparable sparsity levels. The approach is modular and orthogonal to existing N:M methods, offering a practical path to closing the performance gap while preserving hardware-friendly acceleration on N:M sparse tensor cores.

Abstract

This paper presents a Spatial Re-parameterization (SpRe) method for the N:M sparsity. SpRe stems from an observation regarding the restricted variety in spatial sparsity of convolution kernels presented in N:M sparsity compared with unstructured sparsity. Particularly, N:M sparsity exhibits a fixed sparsity rate within the spatial domains due to its distinctive pattern that mandates N non-zero components among M successive weights in the input channel dimension of convolution filters. On the contrary, we observe that conventional unstructured sparsity displays a substantial divergence in sparsity across the spatial domains, which we experimentally verify to be very crucial for its robust performance retention compared with N:M sparsity. Therefore, SpRe employs the spatial-sparsity distribution of unstructured sparsity by assigning an extra branch in conjunction with the original N:M branch at training time, which allows the N:M sparse network to sustain a similar distribution of spatial sparsity with unstructured sparsity. During inference, the extra branch can be further re-parameterized into the main N:M branch, without exerting any distortion on the sparse pattern or additional computation costs. SpRe has achieved a commendable feat by matching the performance of N:M sparsity methods with state-of-the-art unstructured sparsity methods across various benchmarks. Our project is available at https://github.com/zyxxmu/SpRE.

Spatial Re-parameterization for N:M Sparsity

TL;DR

This work addresses the performance gap between structured N:M sparsity and unstructured sparsity by revealing a key property: unstructured sparsity exhibits spatial sparsity variability that benefits feature localization. The authors propose Spatial Re-parameterization (SpRe), which adds a train-time auxiliary branch that follows the unstructured spatial sparsity distribution and can be re-parameterized into the main N:M branch for inference, preserving the N:M pattern and incurring no extra inference cost. Through BN-enabled multi-branch optimization, SpRe maintains spatial sparsity diversity during training and yields consistent accuracy gains across CIFAR-10, ImageNet-1K, and COCO tasks, even matching or surpassing some unstructured sparsity methods at comparable sparsity levels. The approach is modular and orthogonal to existing N:M methods, offering a practical path to closing the performance gap while preserving hardware-friendly acceleration on N:M sparse tensor cores.

Abstract

This paper presents a Spatial Re-parameterization (SpRe) method for the N:M sparsity. SpRe stems from an observation regarding the restricted variety in spatial sparsity of convolution kernels presented in N:M sparsity compared with unstructured sparsity. Particularly, N:M sparsity exhibits a fixed sparsity rate within the spatial domains due to its distinctive pattern that mandates N non-zero components among M successive weights in the input channel dimension of convolution filters. On the contrary, we observe that conventional unstructured sparsity displays a substantial divergence in sparsity across the spatial domains, which we experimentally verify to be very crucial for its robust performance retention compared with N:M sparsity. Therefore, SpRe employs the spatial-sparsity distribution of unstructured sparsity by assigning an extra branch in conjunction with the original N:M branch at training time, which allows the N:M sparse network to sustain a similar distribution of spatial sparsity with unstructured sparsity. During inference, the extra branch can be further re-parameterized into the main N:M branch, without exerting any distortion on the sparse pattern or additional computation costs. SpRe has achieved a commendable feat by matching the performance of N:M sparsity methods with state-of-the-art unstructured sparsity methods across various benchmarks. Our project is available at https://github.com/zyxxmu/SpRE.
Paper Structure (17 sections, 15 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 17 sections, 15 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: A toy example of the discrepancy in spatial sparsity between N:M sparsity at 1:4 pattern and 75% unstructured sparsity. We define spatial sparsity as the sparse level across each spatial location of convolution filters, i.e., the input channel dimension of weight matrices $C_{i}$. (a) N:M sparsity requires N non-zero components among M consecutive weights in the input channel dimension, resulting in equal spatial sparsity $((1-M)/N)$. (b) Unstructured sparsity removes weights at arbitrary locations, resulting in uneven spatial sparsity.
  • Figure 2: Spatial sparsity of common unstructured sparsity methods including Magnitude-based sparsity han2015learning, RigL evci2020rigging, GraNet liu2021sparse. We show spatial sparsity of 3$\times$3 kernels from different layers of ResNet-50 he2016deep with overall 95% sparsity, close to the sparsity level of 1:16 pattern. Experiments are performed on ImageNet-1K deng2009imagenet.
  • Figure 3: Performance comparison for pruning ResNet-32 he2016deep on CIFAR-10 krizhevsky2009learning and ResNet-50 he2016deep on ImageNet-1K deng2009imagenet. The involved methods include the dense version of ResNet-50 (Baseline), SR-STE zhou2021learning (N:M sparsity), unstructured sparse GraNet liu2021sparse, and GraNet with two types of sparsity constraints: (1) The same spatial sparsity along the spatial dimension (GraNet$^{\dag}$), (2) Equal mask flexibility with (1) but no spatial sparsity constraint (GraNet$^{\ddag}$). Performance drops when unstructured sparse method is confined to having the same spatial sparsity, even if the mask flexibility remains the same.
  • Figure 4: Framework of SpRe. (a) The uneven spatial sparsity arising from unstructured sparse weights is leveraged to build an extra branch that reimburses the spatial sparsity of N:M mask (Eq. (\ref{['eq:spatial_mask']})). (b) The extra branch is then trained in conjunction with the main N:M branch to mitigate the performance drop caused by spatial sparsity gap between unstructured sparsity and N:M sparsity. After training, a re-parameterization is performed to merge these two branches to get an N:M sparse merged branch, without altering the output.
  • Figure 5: Loss curve during training 1:16 sparse ResNet-50 using different methods on ImageNet-1k. The final Top-1 accuracy are 72.3%, 71.5%, and 71.8% for ASP w. SpRe (120 epoch), ASP (120 epoch), and ASP (180 epoch), respectively.
  • ...and 1 more figures