EPSD: Early Pruning with Self-Distillation for Efficient Model Compression
Dong Chen, Ning Liu, Yichen Zhu, Zhengping Che, Rui Ma, Fachao Zhang, Xiaofeng Mou, Yi Chang, Jian Tang
TL;DR
The paper addresses the inefficiency of combining pruning with knowledge distillation by introducing EPSD, a two-step framework that prunes an initialized network to retain distillable weights and then trains the pruned model with self-distillation. By evaluating the SD loss during pruning, EPSD identifies weights whose removal least disrupts SD guidance, enabling a pruning mask that preserves SD trainability. Across CIFAR-10/100, Tiny-ImageNet, ImageNet, and downstream tasks, EPSD equipped with multiple SD methods consistently outperforms the simple prune-then-distill baseline and competitive pruning/SD approaches, while substantially reducing the training burden (no pre-trained teacher required). The approach demonstrates strong efficiency and robustness in vision benchmarks and suggests potential for extension to larger multi-modal or language models, highlighting a practical route toward deploying efficient compressed models on edge devices.
Abstract
Neural network compression techniques, such as knowledge distillation (KD) and network pruning, have received increasing attention. Recent work `Prune, then Distill' reveals that a pruned student-friendly teacher network can benefit the performance of KD. However, the conventional teacher-student pipeline, which entails cumbersome pre-training of the teacher and complicated compression steps, makes pruning with KD less efficient. In addition to compressing models, recent compression techniques also emphasize the aspect of efficiency. Early pruning demands significantly less computational cost in comparison to the conventional pruning methods as it does not require a large pre-trained model. Likewise, a special case of KD, known as self-distillation (SD), is more efficient since it requires no pre-training or student-teacher pair selection. This inspires us to collaborate early pruning with SD for efficient model compression. In this work, we propose the framework named Early Pruning with Self-Distillation (EPSD), which identifies and preserves distillable weights in early pruning for a given SD task. EPSD efficiently combines early pruning and self-distillation in a two-step process, maintaining the pruned network's trainability for compression. Instead of a simple combination of pruning and SD, EPSD enables the pruned network to favor SD by keeping more distillable weights before training to ensure better distillation of the pruned network. We demonstrated that EPSD improves the training of pruned networks, supported by visual and quantitative analyses. Our evaluation covered diverse benchmarks (CIFAR-10/100, Tiny-ImageNet, full ImageNet, CUB-200-2011, and Pascal VOC), with EPSD outperforming advanced pruning and SD techniques.
