Table of Contents
Fetching ...

HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

Gustavo Coelho Haase, Paulo Henrique Dourado da Silva

TL;DR

HPM-KD introduces a modular, six-component framework for hierarchical progressive multi-teacher knowledge distillation to address hyperparameter sensitivity, capacity gaps, and inefficient resource use in model compression. By integrating an adaptive configuration manager, automatic progressive chains, attention-weighted multi-teacher ensembling, a meta-learned temperature scheduler, parallel processing, and shared optimization memory, the approach achieves substantial compression with strong retention and notable runtime improvements across CIFAR-10/100 and tabular datasets. Empirical results show the framework outperforms several baselines while maintaining modest overhead, with ablation analyses highlighting the critical role of the multi-teacher component and synergy among modules. The work provides a production-ready, open-source solution that enables automated, scalable distillation, though it also acknowledges cases where direct training or simpler baselines can be preferable depending on compression level and model capacity.

Abstract

Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbf{HPM-KD}, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.

HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

TL;DR

HPM-KD introduces a modular, six-component framework for hierarchical progressive multi-teacher knowledge distillation to address hyperparameter sensitivity, capacity gaps, and inefficient resource use in model compression. By integrating an adaptive configuration manager, automatic progressive chains, attention-weighted multi-teacher ensembling, a meta-learned temperature scheduler, parallel processing, and shared optimization memory, the approach achieves substantial compression with strong retention and notable runtime improvements across CIFAR-10/100 and tabular datasets. Empirical results show the framework outperforms several baselines while maintaining modest overhead, with ablation analyses highlighting the critical role of the multi-teacher component and synergy among modules. The work provides a production-ready, open-source solution that enables automated, scalable distillation, though it also acknowledges cases where direct training or simpler baselines can be preferable depending on compression level and model capacity.

Abstract

Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbf{HPM-KD}, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.

Paper Structure

This paper contains 96 sections, 8 equations, 8 tables, 1 algorithm.