FlyKD: Graph Knowledge Distillation on the Fly with Curriculum Learning

Eugene Ku

FlyKD: Graph Knowledge Distillation on the Fly with Curriculum Learning

Eugene Ku

TL;DR

FlyKD (Knowledge Distillation on the Fly) is proposed which enables the generation of virtually unlimited number of pseudo labels, coupled with Curriculum Learning that greatly alleviates the optimization process over the noisy pseudo labels.

Abstract

Knowledge Distillation (KD) aims to transfer a more capable teacher model's knowledge to a lighter student model in order to improve the efficiency of the model, making it faster and more deployable. However, the student model's optimization process over the noisy pseudo labels (generated by the teacher model) is tricky and the amount of pseudo labels one can generate is limited due to Out of Memory (OOM) error. In this paper, we propose FlyKD (Knowledge Distillation on the Fly) which enables the generation of virtually unlimited number of pseudo labels, coupled with Curriculum Learning that greatly alleviates the optimization process over the noisy pseudo labels. Empirically, we observe that FlyKD outperforms vanilla KD and the renown Local Structure Preserving Graph Convolutional Network (LSPGCN). Lastly, with the success of Curriculum Learning, we shed light on a new research direction of improving optimization over noisy pseudo labels.

FlyKD: Graph Knowledge Distillation on the Fly with Curriculum Learning

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 14 sections, 5 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Related Works
Graph Neural Networks
Knowledge Distillation on Graphs
PrimeKG and TxGNN
Methods
FlyKD
Curriculum Learning
Results
Discussion
Conclusion
Training Details
Additional Experiments
Acknowledgements

Figures (3)

Figure 1: Illustrative representation of PrimeKG dataset. PrimeKG is a collaborative dataset, scraped from 20+ high quality databases. Adapted from [PrimeKG]
Figure 2: Three types of labels in FlyKD. Note that without the pseudo labels on the Degree-aware Random Graph (Green), FlyKD is equivalent to vanilla Knowledge Distillation.
Figure 3: Plot of loss scheduler that incorporates Curriculum Learning. The colors of loss schedulers match with that of Figure \ref{['fig:2']}. Notice that the original loss doesn't completely go to 0 but 0.05 to avoid catastrophic forgetting.

FlyKD: Graph Knowledge Distillation on the Fly with Curriculum Learning

TL;DR

Abstract

FlyKD: Graph Knowledge Distillation on the Fly with Curriculum Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)