Table of Contents
Fetching ...

Dual sparse training framework: inducing activation map sparsity via Transformed $\ell1$ regularization

Xiaolong Yu, Cong Tian

TL;DR

This work tackles the efficiency gap of CNNs on resource-constrained devices by inducing activation-map sparsity using a Transformed $\ell1$ regularizer ($T\ell1$) and by pairing it with weight pruning in a novel three-stage dual sparse training framework. The $T\ell1$ regularizer, defined as $T\ell1(\mathbf{x})=\sum_{x_i}\frac{(1+\beta)|x_i|}{\beta+|x_i|}$ with $\beta>0$, smoothly interpolates between $||\mathbf{x}||_0$ and $||\mathbf{x}||_1$, offering smoother convergence and superior sparsity control over traditional regularizers. The proposed framework first trains a dense model, then sparsifies weights via $\ell1$ regularization and threshold pruning, and finally applies $T\ell1$ to activate further sparsity, yielding balanced reductions in FLOPs across layers. Experiments on MNIST, CIFAR-100, and ImageNet with LeNet5, GoogLeNet, DenseNet121, and ResNet variants show activation-sparsity gains over 20% in most cases (e.g., 27.52% for ResNet18 on ImageNet and 44.04% for LeNet5 on MNIST) and substantial FLOPs reductions (e.g., 81.7% for ResNet18 and 84.13% for ResNet50) without compromising accuracy. Overall, this approach enables more memory- and compute-efficient inference suitable for on-device deployment and accelerator-friendly sparsity.

Abstract

Although deep convolutional neural networks have achieved rapid development, it is challenging to widely promote and apply these models on low-power devices, due to computational and storage limitations. To address this issue, researchers have proposed techniques such as model compression, activation sparsity induction, and hardware accelerators. This paper presents a method to induce the sparsity of activation maps based on Transformed $\ell1$ regularization, so as to improve the research in the field of activation sparsity induction. Further, the method is innovatively combined with traditional pruning, constituting a dual sparse training framework. Compared to previous methods, Transformed $\ell1$ can achieve higher sparsity and better adapt to different network structures. Experimental results show that the method achieves improvements by more than 20\% in activation map sparsity on most models and corresponding datasets without compromising the accuracy. Specifically, it achieves a 27.52\% improvement for ResNet18 on the ImageNet dataset, and a 44.04\% improvement for LeNet5 on the MNIST dataset. In addition, the dual sparse training framework can greatly reduce the computational load and provide potential for reducing the required storage during runtime. Specifically, the ResNet18 and ResNet50 models obtained by the dual sparse training framework respectively reduce 81.7\% and 84.13\% of multiplicative floating-point operations, while maintaining accuracy and a low pruning rate.

Dual sparse training framework: inducing activation map sparsity via Transformed $\ell1$ regularization

TL;DR

This work tackles the efficiency gap of CNNs on resource-constrained devices by inducing activation-map sparsity using a Transformed regularizer () and by pairing it with weight pruning in a novel three-stage dual sparse training framework. The regularizer, defined as with , smoothly interpolates between and , offering smoother convergence and superior sparsity control over traditional regularizers. The proposed framework first trains a dense model, then sparsifies weights via regularization and threshold pruning, and finally applies to activate further sparsity, yielding balanced reductions in FLOPs across layers. Experiments on MNIST, CIFAR-100, and ImageNet with LeNet5, GoogLeNet, DenseNet121, and ResNet variants show activation-sparsity gains over 20% in most cases (e.g., 27.52% for ResNet18 on ImageNet and 44.04% for LeNet5 on MNIST) and substantial FLOPs reductions (e.g., 81.7% for ResNet18 and 84.13% for ResNet50) without compromising accuracy. Overall, this approach enables more memory- and compute-efficient inference suitable for on-device deployment and accelerator-friendly sparsity.

Abstract

Although deep convolutional neural networks have achieved rapid development, it is challenging to widely promote and apply these models on low-power devices, due to computational and storage limitations. To address this issue, researchers have proposed techniques such as model compression, activation sparsity induction, and hardware accelerators. This paper presents a method to induce the sparsity of activation maps based on Transformed regularization, so as to improve the research in the field of activation sparsity induction. Further, the method is innovatively combined with traditional pruning, constituting a dual sparse training framework. Compared to previous methods, Transformed can achieve higher sparsity and better adapt to different network structures. Experimental results show that the method achieves improvements by more than 20\% in activation map sparsity on most models and corresponding datasets without compromising the accuracy. Specifically, it achieves a 27.52\% improvement for ResNet18 on the ImageNet dataset, and a 44.04\% improvement for LeNet5 on the MNIST dataset. In addition, the dual sparse training framework can greatly reduce the computational load and provide potential for reducing the required storage during runtime. Specifically, the ResNet18 and ResNet50 models obtained by the dual sparse training framework respectively reduce 81.7\% and 84.13\% of multiplicative floating-point operations, while maintaining accuracy and a low pruning rate.
Paper Structure (6 sections, 7 equations, 5 figures, 5 tables)

This paper contains 6 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The sparsity of activation maps and weights for each layer of ResNet18
  • Figure 2: Dual sparse training framework
  • Figure 3: The performance of $T\ell1$ (left), $\ell1$ (center), and square Hoyer (right) regularizers on LeNet5, DenseNet121, ResNet34, and ResNet50, respectively.
  • Figure 4: The impact of different values of $\beta$ on Top-1 accuracy and activation map sparsity
  • Figure 5: The percentage of Flops Drops per layer for ResNet18/50