Table of Contents
Fetching ...

Feather: An Elegant Solution to Effective DNN Sparsification

Athanasios Glentis Georgoulakis, George Retsinas, Petros Maragos

TL;DR

Feather addresses the challenge of training-time DNN sparsification by enhancing STE-based pruning with a novel forward thresholding operator and a gradient-scaling mechanism. It introduces a parametric thresholding family with $p$, selects $p=3$ to balance continuity and bias, and employs a gradient scale $\theta$ that adapts to target sparsity, enabling stable, highly sparse models. Across CIFAR-100 and ImageNet, Feather delivers state-of-the-art results at extreme sparsities and improves performance over prior STE-based methods like ST-3 and Spartan, with minimal training overhead. The framework is versatile, compatible with global and layer-wise pruning backbones, and offers a practical path toward FLOPs-efficient sparse networks for resource-constrained deployments.

Abstract

Neural Network pruning is an increasingly popular way for producing compact and efficient models, suitable for resource-limited environments, while preserving high performance. While the pruning can be performed using a multi-cycle training and fine-tuning process, the recent trend is to encompass the sparsification process during the standard course of training. To this end, we introduce Feather, an efficient sparse training module utilizing the powerful Straight-Through Estimator as its core, coupled with a new thresholding operator and a gradient scaling technique, enabling robust, out-of-the-box sparsification performance. Feather's effectiveness and adaptability is demonstrated using various architectures on the CIFAR dataset, while on ImageNet it achieves state-of-the-art Top-1 validation accuracy using the ResNet-50 architecture, surpassing existing methods, including more complex and computationally heavy ones, by a considerable margin. Code is publicly available at https://github.com/athglentis/feather .

Feather: An Elegant Solution to Effective DNN Sparsification

TL;DR

Feather addresses the challenge of training-time DNN sparsification by enhancing STE-based pruning with a novel forward thresholding operator and a gradient-scaling mechanism. It introduces a parametric thresholding family with , selects to balance continuity and bias, and employs a gradient scale that adapts to target sparsity, enabling stable, highly sparse models. Across CIFAR-100 and ImageNet, Feather delivers state-of-the-art results at extreme sparsities and improves performance over prior STE-based methods like ST-3 and Spartan, with minimal training overhead. The framework is versatile, compatible with global and layer-wise pruning backbones, and offers a practical path toward FLOPs-efficient sparse networks for resource-constrained deployments.

Abstract

Neural Network pruning is an increasingly popular way for producing compact and efficient models, suitable for resource-limited environments, while preserving high performance. While the pruning can be performed using a multi-cycle training and fine-tuning process, the recent trend is to encompass the sparsification process during the standard course of training. To this end, we introduce Feather, an efficient sparse training module utilizing the powerful Straight-Through Estimator as its core, coupled with a new thresholding operator and a gradient scaling technique, enabling robust, out-of-the-box sparsification performance. Feather's effectiveness and adaptability is demonstrated using various architectures on the CIFAR dataset, while on ImageNet it achieves state-of-the-art Top-1 validation accuracy using the ResNet-50 architecture, surpassing existing methods, including more complex and computationally heavy ones, by a considerable margin. Code is publicly available at https://github.com/athglentis/feather .
Paper Structure (18 sections, 3 equations, 8 figures, 4 tables)

This paper contains 18 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (a) The proposed sparse training block, utilizing the new thresholding operator and the gradient scaling mask (b) the proposed family of thresholding operators for varying values of $p$. We adopt $p=3$, resulting to a fine balance between the two extremes, hard and soft thresholding respectively.
  • Figure 2: Study of the effect of the thresholding operator on the final sparse model accuracy. The proposed threshold steadily outperforms the hard and soft operators.
  • Figure 3: Study of the effect of gradient scaling. Under conservative final sparsity, $\theta$ near unity is preferable, while when targeting high sparsity, models benefit from $\theta$ near the middle of its range.
  • Figure 4: Gradient scaling improves the final accuracy at high sparsity, regardless the thresholding operator, while maximum performance is achieved if combined with the proposed threshold.
  • Figure 5: A study of the effect of the $p$ value of the proposed family of thresholds on the final sparse model accuracy. Results from ResNet-20 trained on CIFAR-100 (a) and the corresponding thresholds used (b).
  • ...and 3 more figures