Education distillation:getting student models to learn in shcools

Ling Feng; Tianhao Wu; Xiangrong Ren; Zhi Jing; Xuliang Duan

Education distillation:getting student models to learn in shcools

Ling Feng, Tianhao Wu, Xiangrong Ren, Zhi Jing, Xuliang Duan

TL;DR

The work tackles the challenge of efficient knowledge transfer and catastrophic forgetting in convolutional neural networks during knowledge distillation. It proposes Education Distillation (ED), a progressive framework that splits the student into a main body and three Teaching Reference (TR) blocks, training on partitioned sub-datasets and adding TR blocks at staged epochs to rebuild the full model. The formalization combines a distillation loss with standard supervision across stages, yielding an overall ED loss that is approximately the sum of stage-wise losses, with non-overlapping feature partitions to enhance learning efficiency. Empirically, ED outperforms both single-teacher and multi-teacher KD across diverse architectures and datasets (CIFAR100, Tiny ImageNet, Caltech256, Food-101), demonstrating improved accuracy and generalization. The approach offers a practical path toward lifelong learning in vision tasks, accompanied by a public codebase for replication and extension.

Abstract

This paper introduces a new knowledge distillation method, called education distillation (ED), which is inspired by the structured and progressive nature of human learning. ED mimics the educational stages of primary school, middle school, and university and designs teaching reference blocks. The student model is split into a main body and multiple teaching reference blocks to learn from teachers step by step. This promotes efficient knowledge distillation while maintaining the architecture of the student model. Experimental results on the CIFAR100, Tiny Imagenet, Caltech and Food-101 datasets show that the teaching reference blocks can effectively avoid the problem of forgetting. Compared with conventional single-teacher and multi-teacher knowledge distillation methods, ED significantly improves the accuracy and generalization ability of the student model. These findings highlight the potential of ED to improve model performance across different architectures and datasets, indicating its value in various deep learning scenarios. Code examples can be obtained at: https://github.com/Revolutioner1/ED.git.

Education distillation:getting student models to learn in shcools

TL;DR

Abstract

Paper Structure (7 sections, 10 equations, 1 figure, 8 tables)

This paper contains 7 sections, 10 equations, 1 figure, 8 tables.

Introduction
Method
Education Distillation Method
The Phenomenon of Catastrophic Forgetting
Formalization
Experiments
Conclusion

Figures (1)

Figure 1: In educational distillation, the student model is divided into a main body and three teaching reference blocks. At the same time, the dataset is partitioned into three sub-datasets for distillation. The model starts at the "primary school" stage, where it trains on sub-dataset 1. When the $n$-th epoch is reached and fitting is achieved, the second teaching reference block is added, and the model progresses to the "secondary school" stage, training on sub-dataset 2. When the $m$-th epoch is reached and fitting is achieved, the final reference block is added, and the model reaches the "university" stage, training on sub-dataset 3.

Education distillation:getting student models to learn in shcools

TL;DR

Abstract

Education distillation:getting student models to learn in shcools

Authors

TL;DR

Abstract

Table of Contents

Figures (1)