Table of Contents
Fetching ...

Right Time to Learn:Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation

Guanglong Sun, Hongwei Yan, Liyuan Wang, Qian Li, Bo Lei, Yi Zhong

TL;DR

Right Time to Learn introduces Spaced KD, a bio-inspired, temporally spaced distillation strategy that advances the teacher several steps ahead of the student. The authors show theoretically that spacing leads to a flatter loss landscape, evidenced by reduced Hessian trace, and provide extensive empirical validation across CNN and ViT backbones on CIFAR, Tiny-ImageNet, and ImageNet-1K, with consistent gains over online KD and self KD. Key contributions include a formal spacing mechanism with an interval parameter $s$, a Hessian-based analysis, and demonstration of Spaced KD’s generality across loss functions and KD variants. The work suggests a practical, plug-in approach to improve KD generalization without extra training cost, and offers insights into the temporal dynamics of knowledge transfer that could inform neuroscience-inspired learning frameworks.

Abstract

Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact "student" model from a large "teacher" model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively). Our codes have been released on github https://github.com/SunGL001/Spaced-KD.

Right Time to Learn:Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation

TL;DR

Right Time to Learn introduces Spaced KD, a bio-inspired, temporally spaced distillation strategy that advances the teacher several steps ahead of the student. The authors show theoretically that spacing leads to a flatter loss landscape, evidenced by reduced Hessian trace, and provide extensive empirical validation across CNN and ViT backbones on CIFAR, Tiny-ImageNet, and ImageNet-1K, with consistent gains over online KD and self KD. Key contributions include a formal spacing mechanism with an interval parameter , a Hessian-based analysis, and demonstration of Spaced KD’s generality across loss functions and KD variants. The work suggests a practical, plug-in approach to improve KD generalization without extra training cost, and offers insights into the temporal dynamics of knowledge transfer that could inform neuroscience-inspired learning frameworks.

Abstract

Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact "student" model from a large "teacher" model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively). Our codes have been released on github https://github.com/SunGL001/Spaced-KD.

Paper Structure

This paper contains 46 sections, 2 theorems, 11 equations, 6 figures, 13 tables, 4 algorithms.

Key Result

Lemma 4.3

$u_{k(t)} \leq u_t$.

Figures (6)

  • Figure 1: Diagram of Spaced KD. In online KD, the teacher and student are two individual networks. In self KD, we follow the prior work self-kd that distills knowledge from the deepest layer to the shallower layers of the same network. In Spaced KD, we train the teacher with a controllable space interval steps ahead and then distill its knowledge to the student network.
  • Figure 2: Alignment of spaced learning in BNNs and DNNs. (a) Computational cognitive model of spaced learning, modified from landauer1969reinforcement. (b) Overall performance of Spaced KD from different networks and benchmarks. R18: ResNet-18; R50: ResNet-50; R101: ResNet-101; C100: CIFAR-100; T200: Tiny-ImageNet. (c) Quadratic polynomial fitting of all performance from (b).
  • Figure 3: Impact of different initiating times of Spaced KD ($s=1.5$), which is introduced (a) for constant 10 training epochs or (b) till the end of training.
  • Figure 4: Impact of Gaussian noise on performance.
  • Figure 5: Hyperparameter validation for Spaced KD. Accuracy of different learning rate (a) and batch size (b) of gradient intervals.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 4.1: Local linearization.
  • Definition 4.2: Teacher model gap
  • Lemma 4.3: Lower risk of spaced teacher
  • proof
  • Theorem 4.4
  • proof