Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation
Md. Samiul Alim, Sharjil Khan, Amrijit Biswas, Fuad Rahman, Shafin Rahman, Nabeel Mohammed
TL;DR
This work tackles the heavy computational cost of unstructured pruning by introducing a teacher-guided one-shot pruning framework that integrates Knowledge Distillation into the pruning signal. By computing gradient-based parameter importance from a joint loss L_Total = $\alpha L_{\text{CA-KLD}} + (1-\alpha) L_{\text{CE}}$ and performing global thresholding, the method prunes a large fraction of weights in a single step and then applies sparsity-aware retraining with or without KD. Across CIFAR-10, CIFAR-100, and Tiny ImageNet, the approach achieves high sparsity (up to ~98%) with minimal performance loss and outperforms several state-of-the-art baselines such as EPG, EPSD, and COLT in many regimes, while significantly reducing training latency. The results demonstrate the practical viability of integrating KD directly into pruning decisions, enabling efficient, scalable deployment of sparse networks in resource-constrained environments.
Abstract
Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.
