Table of Contents
Fetching ...

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, Song Han

TL;DR

This paper proposes AutoML for Model Compression (AMC) which leverages reinforcement learning to efficiently sample the design space and can improve the model compression quality and achieves state-of-the-art model compression results in a fully automated way without any human efforts.

Abstract

Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4x FLOPs reduction, we achieved 2.7% better accuracy than the handcrafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81x speedup of measured inference latency on an Android phone and 1.43x speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy.

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

TL;DR

This paper proposes AutoML for Model Compression (AMC) which leverages reinforcement learning to efficiently sample the design space and can improve the model compression quality and achieves state-of-the-art model compression results in a fully automated way without any human efforts.

Abstract

Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4x FLOPs reduction, we achieved 2.7% better accuracy than the handcrafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81x speedup of measured inference latency on an Android phone and 1.43x speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of AutoML for Model Compression (AMC) engine. Left: AMC replaces human and makes model compression fully automated while performing better than human. Right: Form AMC as a reinforcement learning problem. We process a pre-trained network (e.g., MobileNet) in a layer-by-layer manner. Our reinforcement learning agent (DDPG) receives the embedding $s_t$ from a layer $t$, and outputs a sparsity ratio $a_t$. After the layer is compressed with $a_t$, it moves to the next layer $L_{t+1}$. The accuracy of the pruned model with all layers compressed is evaluated. Finally, as a function of accuracy and FLOP, reward $R$ is returned to the reinforcement learning agent.
  • Figure 2: Comparisons of pruning strategies for Plain-20 under $2\times$. Uniform policy sets the same compression ratio for each layer uniformly. Shallow and deep policies aggressively prune shallow and deep layers respectively. Policy given by AMC looks like sawtooth, which resembles the bottleneck architecture he2016deep. The accuracy given by AMC outperforms hand-crafted policies. (better viewed in color)
  • Figure 3: The pruning policy (sparsity ratio) given by our reinforcement learning agent for ResNet-50. With 4 stages of iterative pruning, we find very salient sparsity pattern across layers: peaks are $1\times1$ convolution, crests are $3\times3$ convolution. The reinforcement learning agent automatically learns that $3\times3$ convolution has more redundancy than $1\times1$ convolution and can be pruned more.
  • Figure 4: Our reinforcement learning agent (AMC) can prune the model to a lower density compared with human experts without losing accuracy. (Human expert: $3.4\times$ compression on ResNet50. AMC : $5\times$ compression on ResNet50.)
  • Figure 5: (a) Comparing the accuracy and MAC trade-off among AMC, human expert, and unpruned MobileNet. AMC strictly dominates human expert in the pareto optimal curve. (b) Comparing the accuracy and latency trade-off among AMC, NetAdapt, and unpruned MobileNet. AMC significantly improves the pareto curve of MobileNet. Reinforcement-learning based AMC surpasses heuristic-based NetAdapt on the pareto curve (inference time both measured on Google Pixel 1).