Table of Contents
Fetching ...

Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

Md. Samiul Alim, Sharjil Khan, Amrijit Biswas, Fuad Rahman, Shafin Rahman, Nabeel Mohammed

TL;DR

This work tackles the heavy computational cost of unstructured pruning by introducing a teacher-guided one-shot pruning framework that integrates Knowledge Distillation into the pruning signal. By computing gradient-based parameter importance from a joint loss L_Total = $\alpha L_{\text{CA-KLD}} + (1-\alpha) L_{\text{CE}}$ and performing global thresholding, the method prunes a large fraction of weights in a single step and then applies sparsity-aware retraining with or without KD. Across CIFAR-10, CIFAR-100, and Tiny ImageNet, the approach achieves high sparsity (up to ~98%) with minimal performance loss and outperforms several state-of-the-art baselines such as EPG, EPSD, and COLT in many regimes, while significantly reducing training latency. The results demonstrate the practical viability of integrating KD directly into pruning decisions, enabling efficient, scalable deployment of sparse networks in resource-constrained environments.

Abstract

Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.

Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

TL;DR

This work tackles the heavy computational cost of unstructured pruning by introducing a teacher-guided one-shot pruning framework that integrates Knowledge Distillation into the pruning signal. By computing gradient-based parameter importance from a joint loss L_Total = and performing global thresholding, the method prunes a large fraction of weights in a single step and then applies sparsity-aware retraining with or without KD. Across CIFAR-10, CIFAR-100, and Tiny ImageNet, the approach achieves high sparsity (up to ~98%) with minimal performance loss and outperforms several state-of-the-art baselines such as EPG, EPSD, and COLT in many regimes, while significantly reducing training latency. The results demonstrate the practical viability of integrating KD directly into pruning decisions, enabling efficient, scalable deployment of sparse networks in resource-constrained environments.

Abstract

Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.

Paper Structure

This paper contains 33 sections, 16 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of accuracy comparison between our method and existing approaches across sparsity levels (50%, 75%, 90%, 94%, 98%). In the radar plot, a larger enclosed area indicates higher accuracy, where Our Method consistently achieves superior performance on CIFAR-10 datasets, clearly outperforming competing baselines. Detailed results including CIFAR-10 and other datasets are presented in the Results section.
  • Figure 2: Overview of the proposed teacher-guided one-shot pruning framework. The student is first trained with a combined KD and task loss, then a gradient-based importance score is used to generate a binary pruning mask. The pruned model is retrained under two regimes: (1) standard fine-tuning and (2) KD-guided retraining, to recover performance while maintaining sparsity.
  • Figure 3: Teacher Guided Important Score Computation
  • Figure 4: Top-1 accuracy of sparse ResNet-18 on CIFAR-10 at varying sparsity levels (36% to 95%), comparing our method against six baselines: CS-KD Simple chen2024epsdearlypruningselfdistillationyun2020regularizing, CS-KD EPSD chen2024epsdearlypruningselfdistillation, PS-KD Simple chen2024epsdearlypruningselfdistillationkim2021paraphrasing, PS-KD EPSD chen2024epsdearlypruningselfdistillation, DLB Simple chen2024epsdearlypruningselfdistillationshen2022dynamic, and DLB EPSD chen2024epsdearlypruningselfdistillation. Our method consistently outperforms all baselines at high and moderate sparsity levels.
  • Figure 5: Top-1 accuracy comparison of ResNet-18 on CIFAR-100 across five sparsity levels (36% to 95%). Results are benchmarked against six baselines: CS-KD Simple chen2024epsdearlypruningselfdistillationyun2020regularizing, CS-KD EPSD chen2024epsdearlypruningselfdistillation, PS-KD Simple chen2024epsdearlypruningselfdistillationkim2021paraphrasing, PS-KD EPSD chen2024epsdearlypruningselfdistillation, DLB Simple chen2024epsdearlypruningselfdistillationshen2022dynamic , and DLB EPSD chen2024epsdearlypruningselfdistillation. Our method achieves consistently higher accuracy, especially at moderate sparsity levels.
  • ...and 1 more figures