Table of Contents
Fetching ...

Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch

Xidong Wu, Shangqian Gao, Zeyu Zhang, Zhenzhen Li, Runxue Bao, Yanfu Zhang, Xiaoqian Wang, Heng Huang

TL;DR

This paper proposed the Auto-Train-Once (A TO), an innovative net-work pruning algorithm designed to automatically reduce the computational and storage costs of DNNs and developed a novel stochastic gradient algorithm that enhances the coordination between model training and controller network training, thereby proving pruning performance.

Abstract

Current techniques for deep neural network (DNN) pruning often involve intricate multi-step processes that require domain-specific expertise, making their widespread adoption challenging. To address the limitation, the Only-Train-Once (OTO) and OTOv2 are proposed to eliminate the need for additional fine-tuning steps by directly training and compressing a general DNN from scratch. Nevertheless, the static design of optimizers (in OTO) can lead to convergence issues of local optima. In this paper, we proposed the Auto-Train-Once (ATO), an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs. During the model training phase, our approach not only trains the target model but also leverages a controller network as an architecture generator to guide the learning of target model weights. Furthermore, we developed a novel stochastic gradient algorithm that enhances the coordination between model training and controller network training, thereby improving pruning performance. We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures (including ResNet18, ResNet34, ResNet50, ResNet56, and MobileNetv2) on standard benchmark datasets (CIFAR-10, CIFAR-100, and ImageNet).

Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch

TL;DR

This paper proposed the Auto-Train-Once (A TO), an innovative net-work pruning algorithm designed to automatically reduce the computational and storage costs of DNNs and developed a novel stochastic gradient algorithm that enhances the coordination between model training and controller network training, thereby proving pruning performance.

Abstract

Current techniques for deep neural network (DNN) pruning often involve intricate multi-step processes that require domain-specific expertise, making their widespread adoption challenging. To address the limitation, the Only-Train-Once (OTO) and OTOv2 are proposed to eliminate the need for additional fine-tuning steps by directly training and compressing a general DNN from scratch. Nevertheless, the static design of optimizers (in OTO) can lead to convergence issues of local optima. In this paper, we proposed the Auto-Train-Once (ATO), an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs. During the model training phase, our approach not only trains the target model but also leverages a controller network as an architecture generator to guide the learning of target model weights. Furthermore, we developed a novel stochastic gradient algorithm that enhances the coordination between model training and controller network training, thereby improving pruning performance. We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures (including ResNet18, ResNet34, ResNet50, ResNet56, and MobileNetv2) on standard benchmark datasets (CIFAR-10, CIFAR-100, and ImageNet).
Paper Structure (18 sections, 6 theorems, 47 equations, 4 figures, 5 tables, 3 algorithms)

This paper contains 18 sections, 6 theorems, 47 equations, 4 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

Assume that the sequence $\{z_t \}_{t=1}^T$ be generated from the Algorithm ATO (details of definition of variables are provided in the supplementary materials). When we have hyperparameters $\eta_t = \frac{\hat{c}}{(\bar{c}+t)^{1/2}}$, $\frac{\hat{c}}{\bar{c}^{1/2}} \leq \min \{1, \frac{\epsilon}{4 where $G = \frac{4 (\mathcal{J}(z_1) - \mathcal{J}(z^*))}{\epsilon \hat{c}} + \frac{2\sigma^2}{b L

Figures (4)

  • Figure 1: Overview of Auto-Train-Once (ATO). The controller network generates mask $\mathbf{w}$ based on the size of ZIGs $\mathcal{G}$ to guide the automatic network pruning of the target model and we remove variable groups according to mask $\mathbf{w}$ after training. Additional training (such as fine-tuning) is not required after model training and we can directly get the final compressed model.
  • Figure 2: (a, e): the impact of $\lambda$ in regularization term in \ref{['eq:1']}. (b, f): the effect of hyperparameter $\gamma$ in $\mathcal{R}_{\text{FLOPs}}$ in \ref{['eq:3']}. (c, g): the effect of $T_{\text{w}}$. (d, h): the effect of the project operation as in \ref{['eq:2_1']} and \ref{['eq:2_1']}. Experiments are conducted on CIFAR-10 with ResNet-56 and $p=0.45$ (a,b,c,d) and $p=0.35$ (e,f,g,h).
  • Figure 3: ResNet50 on ImageNet
  • Figure 4: the effect of hyperparameter $\gamma$ in $\mathcal{R}_{\text{FLOPs}}$ in \ref{['eq:3']}. Experiments are conducted on CIFAR-10 with ResNet-56 with $p=0.45$ (a) and $p=0.35$ (b).

Theorems & Definitions (12)

  • Definition 1
  • Theorem 1
  • Remark 1
  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 2 more