Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework

Junxian Li; Bin Shi; Erfei Cui; Hua Wei; Qinghua Zheng

Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework

Junxian Li, Bin Shi, Erfei Cui, Hua Wei, Qinghua Zheng

TL;DR

This work is the first work to include hidden layer distillation for student MLP on graphs and to combine graph Positional Encoding with MLP, and is the first work to include hidden layer distillation for student MLP on graphs.

Abstract

We study the challenging problem for inference tasks on large-scale graph datasets of Graph Neural Networks: huge time and memory consumption, and try to overcome it by reducing reliance on graph structure. Even though distilling graph knowledge to student MLP is an excellent idea, it faces two major problems of positional information loss and low generalization. To solve the problems, we propose a new three-stage multitask distillation framework. In detail, we use Positional Encoding to capture positional information. Also, we introduce Neural Heat Kernels responsible for graph data processing in GNN and utilize hidden layer outputs matching for better performance of student MLP's hidden layers. To the best of our knowledge, it is the first work to include hidden layer distillation for student MLP on graphs and to combine graph Positional Encoding with MLP. We test its performance and robustness with several settings and draw the conclusion that our work can outperform well with good stability.

Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework

TL;DR

Abstract

Paper Structure (31 sections, 1 theorem, 15 equations, 4 figures, 10 tables)

This paper contains 31 sections, 1 theorem, 15 equations, 4 figures, 10 tables.

Introduction
Related Work
Preliminaries
Motivation.
NHK Based Distillation
Positional Encoding for global distillation.
KL-Divergence
Methodology
Three-Stage distillation
Distillation loss
Usage of trainable reverse kernel
Discussion
Experiments
Dataset
Baselines
...and 16 more sections

Key Result

theorem thmcountertheorem

(Effectiveness of Spectral Clustering). Suppose that whole graph consists of multiple graph partitions that are similar internally and differ significantly between partitions. Thus, we need to find such optimal partitions. Given a partition of the graph into k sets, we can define k indicator vectors It can be proven that clustering can provide structural information by providing similarities and E

Figures (4)

Figure 1: Overall structure of our framework. It consists of three stages: GNN pretraining, distillation and inference. We transfer knowledge to student MLP during stage II. Best viewed in color.
Figure 2: Results of Robustness study with different teachers on three datasets. The x-axis shows percentage of feature noise and y-axis means test accuracy(%).
Figure 3: Results of sensitiveness study with different teachers on three datasets. Mean and variance of test accuracy reported. We study if changes of $\gamma$ can influence student MLP's performance greatly. The x-axis shows different $\gamma$s and y-axis means test accuracy(%) with them.
Figure 4: Results of trainable kernel on three benchmark datasets, compared to kernels without self training

Theorems & Definitions (2)

definition thmcounterdefinition
theorem thmcountertheorem

Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework

TL;DR

Abstract

Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (2)