Table of Contents
Fetching ...

Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna

TL;DR

This work tackles the challenge of maintaining model quality under very high N:M sparsity in Transformers. It identifies gradient-noise in sparse training as a major bottleneck and introduces decaying-gradient-flow recipes (MdGf and SdGf) to progressively limit gradient flow for pruned weights while preserving early training dynamics. The proposed methods achieve consistent accuracy gains (up to ~2% in vision and ~5% in language tasks) and reduce training and inference FLOPs, with MdGf-Exponential yielding near-dense performance at extreme sparsity and substantial computational savings. The results across ViT, SwinV2, and T5X-Base demonstrate the practical impact of gradient-flow control for scalable, hardware-friendly N:M sparsity in large-scale transformers.

Abstract

N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions ($>$80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2$\%$ and 5$\%$ in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method yields better performance compared to conventional sparse training recipes, exhibiting an accuracy improvement of up to 2$\%$. The source code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.

Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

TL;DR

This work tackles the challenge of maintaining model quality under very high N:M sparsity in Transformers. It identifies gradient-noise in sparse training as a major bottleneck and introduces decaying-gradient-flow recipes (MdGf and SdGf) to progressively limit gradient flow for pruned weights while preserving early training dynamics. The proposed methods achieve consistent accuracy gains (up to ~2% in vision and ~5% in language tasks) and reduce training and inference FLOPs, with MdGf-Exponential yielding near-dense performance at extreme sparsity and substantial computational savings. The results across ViT, SwinV2, and T5X-Base demonstrate the practical impact of gradient-flow control for scalable, hardware-friendly N:M sparsity in large-scale transformers.

Abstract

N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions (50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions (80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2 and 5 in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method yields better performance compared to conventional sparse training recipes, exhibiting an accuracy improvement of up to 2. The source code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.
Paper Structure (67 sections, 4 equations, 14 figures, 34 tables)

This paper contains 67 sections, 4 equations, 14 figures, 34 tables.

Figures (14)

  • Figure 2: An overview of different sparse training recipes (a) SR-STE sr_ste, (b, c) proposed decaying mechanisms in this work. (b) indicates decaying binary mask values for pruned weights (MdGf), whereas (c) gradually change the N:M sparsity patters at different intervals (SdGf).
  • Figure 3: Trends for different indicators of gradient values during training. Data from ViT-tiny trained on CIFAR-10 with 1:16 sparsity pattern. (a) and (b) show the running average of the variance of AdamW second moment and gradient variance, respectively.
  • Figure 4: FLOP vs. Accuracy for ViT-Base+ImageNet-1K.
  • Figure 5: ViT-Base trained on ImageNet-1K with different sparsity patterns and targets. (a) shows the Occam's hill where sparsity improves the model accuracy. The dashed red line shows the reduction in inference FLOPs at different sparsity ration. At high sparsity regime ($>$80%) MdGf yields better accuracy than SR-STE and (b) demonstrates model accuracy across training recipes (dense and sparse) at different training FLOPs. The vertical line indicates the proposed decaying method is better (1.6%) than dense model at given training FLOPS. The vertical line shows that the decaying based method reaches to dense model accuracy at 37.8% less training FLOPs.
  • Figure 6: Training Epochs vs Accuracy graph for different sparsity targets. We train ViT-Base on ImageNet-1K.
  • ...and 9 more figures