A Teacher-Free Graph Knowledge Distillation Framework with Dual Self-Distillation

Lirong Wu; Haitao Lin; Zhangyang Gao; Guojiang Zhao; Stan Z. Li

A Teacher-Free Graph Knowledge Distillation Framework with Dual Self-Distillation

Lirong Wu, Haitao Lin, Zhangyang Gao, Guojiang Zhao, Stan Z. Li

TL;DR

The paper tackles the practical latency gap of graph modeling by proposing a Teacher-Free Graph Self-Distillation (TGS) framework that uses a pure MLP backbone to learn topology-aware representations during training without any teacher or GNN during inference. TGS employs dual self-distillation—feature-level from neighbors to the target and label-level from the target to neighbors—augmented by mixup-style neighbor interpolation and batch-edge sampling to manage memory and computation. Empirically, TGS substantially improves vanilla MLP performance (average gains around 15.5%) and, on six real-world datasets, achieves competitive or superior results relative to state-of-the-art GKD methods, while providing dramatic inference-speed advantages (75X–89X faster than GNNs and 16X–25X faster than classical acceleration methods). The approach demonstrates robustness to limited and noisy labels and scales well with graph size, highlighting a practical path toward topology-aware graph reasoning with fast inference and reduced data dependency at deployment.

Abstract

Recent years have witnessed great success in handling graph-related tasks with Graph Neural Networks (GNNs). Despite their great academic success, Multi-Layer Perceptrons (MLPs) remain the primary workhorse for practical industrial applications. One reason for such an academic-industry gap is the neighborhood-fetching latency incurred by data dependency in GNNs. To reduce their gaps, Graph Knowledge Distillation (GKD) is proposed, usually based on a standard teacher-student architecture, to distill knowledge from a large teacher GNN into a lightweight student GNN or MLP. However, we found in this paper that neither teachers nor GNNs are necessary for graph knowledge distillation. We propose a Teacher-Free Graph Self-Distillation (TGS) framework that does not require any teacher model or GNNs during both training and inference. More importantly, the proposed TGS framework is purely based on MLPs, where structural information is only implicitly used to guide dual knowledge self-distillation between the target node and its neighborhood. As a result, TGS enjoys the benefits of graph topology awareness in training but is free from data dependency in inference. Extensive experiments have shown that the performance of vanilla MLPs can be greatly improved with dual self-distillation, e.g., TGS improves over vanilla MLPs by 15.54% on average and outperforms state-of-the-art GKD algorithms on six real-world datasets. In terms of inference speed, TGS infers 75X-89X faster than existing GNNs and 16X-25X faster than classical inference acceleration methods.

A Teacher-Free Graph Knowledge Distillation Framework with Dual Self-Distillation

TL;DR

Abstract

Paper Structure (33 sections, 9 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 9 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Graph Neural Networks
Graph Knowledge Distillation
Graph Contrastive Learning (GCL)
Preliminaries
Methodology
Backbone Architecture
Dual Knowledge Self-Distillation
Feature-level Self-Distillation
Label-level Self-Distillation
Training and Inferring
Model Training
Model Inferring
Discussion and Comparison
...and 18 more sections

Figures (5)

Figure 1: (a): Illustration of four different types of graph knowledge distillation algorithms, depending on whether teacher models and GNNs/MLPs are included in training and inference. (b): Inference accuracy (%) vs. inference time (ms) on the Cora dataset. If not specifically mentioned, all GKD algorithms adopt GCN as the backbone by default.
Figure 2: Illustration of the proposed Teacher-Free Graph Self-Distillation (TGS) framework. In the training stage, the MLP and two inference layers $f_\theta(\cdot)$, $g_\gamma(\cdot)$ are jointly trained by the proposed dual feature-level and label-level self-distillation.
Figure 3: (a)(b) Classification accuracy (%) under different label noise ratios on the Cora and Citeseer datasets, respectively. (c) Inference time (ms) with different layers on the Coauthor-CS dataset.
Figure 4: (a) Ablation study on four key model components. (b) Learning curves of MLPs and TGS on the Cora dataset, showing that self-distillation helps to regularize the training. The logarithmized vertical coordinate is the cross-entropy loss between the predicted and ground-truth labels on the training or validation set, respectively. (c) Mean cosine similarity curves of MLPs, GCNs, and TGS between the target node with 1-hop and 2-hop neighbors on the Cora dataset.
Figure 5: Hyperparameter sensitivity analysis on the batch size $B$ (Left) and trade-off weight $\alpha$ (Right) on four datasets.

A Teacher-Free Graph Knowledge Distillation Framework with Dual Self-Distillation

TL;DR

Abstract

A Teacher-Free Graph Knowledge Distillation Framework with Dual Self-Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)