EPSD: Early Pruning with Self-Distillation for Efficient Model Compression

Dong Chen; Ning Liu; Yichen Zhu; Zhengping Che; Rui Ma; Fachao Zhang; Xiaofeng Mou; Yi Chang; Jian Tang

EPSD: Early Pruning with Self-Distillation for Efficient Model Compression

Dong Chen, Ning Liu, Yichen Zhu, Zhengping Che, Rui Ma, Fachao Zhang, Xiaofeng Mou, Yi Chang, Jian Tang

TL;DR

The paper addresses the inefficiency of combining pruning with knowledge distillation by introducing EPSD, a two-step framework that prunes an initialized network to retain distillable weights and then trains the pruned model with self-distillation. By evaluating the SD loss during pruning, EPSD identifies weights whose removal least disrupts SD guidance, enabling a pruning mask that preserves SD trainability. Across CIFAR-10/100, Tiny-ImageNet, ImageNet, and downstream tasks, EPSD equipped with multiple SD methods consistently outperforms the simple prune-then-distill baseline and competitive pruning/SD approaches, while substantially reducing the training burden (no pre-trained teacher required). The approach demonstrates strong efficiency and robustness in vision benchmarks and suggests potential for extension to larger multi-modal or language models, highlighting a practical route toward deploying efficient compressed models on edge devices.

Abstract

Neural network compression techniques, such as knowledge distillation (KD) and network pruning, have received increasing attention. Recent work `Prune, then Distill' reveals that a pruned student-friendly teacher network can benefit the performance of KD. However, the conventional teacher-student pipeline, which entails cumbersome pre-training of the teacher and complicated compression steps, makes pruning with KD less efficient. In addition to compressing models, recent compression techniques also emphasize the aspect of efficiency. Early pruning demands significantly less computational cost in comparison to the conventional pruning methods as it does not require a large pre-trained model. Likewise, a special case of KD, known as self-distillation (SD), is more efficient since it requires no pre-training or student-teacher pair selection. This inspires us to collaborate early pruning with SD for efficient model compression. In this work, we propose the framework named Early Pruning with Self-Distillation (EPSD), which identifies and preserves distillable weights in early pruning for a given SD task. EPSD efficiently combines early pruning and self-distillation in a two-step process, maintaining the pruned network's trainability for compression. Instead of a simple combination of pruning and SD, EPSD enables the pruned network to favor SD by keeping more distillable weights before training to ensure better distillation of the pruned network. We demonstrated that EPSD improves the training of pruned networks, supported by visual and quantitative analyses. Our evaluation covered diverse benchmarks (CIFAR-10/100, Tiny-ImageNet, full ImageNet, CUB-200-2011, and Pascal VOC), with EPSD outperforming advanced pruning and SD techniques.

EPSD: Early Pruning with Self-Distillation for Efficient Model Compression

TL;DR

Abstract

Paper Structure (32 sections, 12 equations, 11 figures, 17 tables, 1 algorithm)

This paper contains 32 sections, 12 equations, 11 figures, 17 tables, 1 algorithm.

Introduction
Related Works
Early Pruning with Self-Distillation
The 'Simple Combination'
Identify Distillable Weights via SD
Towards Efficient Model Compression
Experiments
EPSD equipped with Various SD Methods
Comparison of Pruning Methods
Comparison of SD Methods
Impact of SD-based Pre-training
Downstream Tasks
Discussion and Limitation
Conclusion
Datasets and Networks
...and 17 more sections

Figures (11)

Figure 1: Comparison of different model compression schemes. (a) PKD park2021prune follows four steps to combine pruning and KD. (b) Our Early Pruning with SD (EPSD) needs only two steps for compression.
Figure 2: Performance comparison among the 'Simple Combination', pre-trained network ('Unpruned Baseline'), the network only performs pruning without fine-tuning with SD ('Pruning Only'), and the network only performs SD without any sparsity ('SD Only') on CIFAR-100 of ResNet-18. The 'Simple Combination' suffered severe performance degradation, especially under the high sparsity ratio $95\%$.
Figure 3: EPSD prunes a random initialized network with weights $\theta_{init}$ in step-1 (blue block) and then employs the SD algorithm to train the pruned network in step-2 (orange block). In Step 1, EPSD identifies and retains distillable weights by measuring the impact of SD loss on individual weights after $i$ steps of training.
Figure 4: Trainability analysis. Top: Loss contour plots of early-pruned networks using (a) 'Simple Combination' and (b) EPSD. Bottom: Comparison of Mean-JSV curves of EPSD and the 'Simple Combination' approach.
Figure 5: Training efforts comparisons among various representative compression approaches. Left: Total training epochs of CPKD aghli2021combining, PKD park2021prune, ReKD chen2021distilling, DMC gao2020discrete. 'PR' and 'KD (SD)' denote pruning and knowledge distillation (self-distillation), respectively. Right: Comparison of total training wall time under identical conditions.
...and 6 more figures

EPSD: Early Pruning with Self-Distillation for Efficient Model Compression

TL;DR

Abstract

EPSD: Early Pruning with Self-Distillation for Efficient Model Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (11)