Table of Contents
Fetching ...

Efficient Graph Knowledge Distillation from GNNs to Kolmogorov--Arnold Networks via Self-Attention Dynamic Sampling

Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li

TL;DR

This work tackles the challenge of deploying graph models on resource-constrained devices by distilling GNN knowledge into a Fourier-based Kolmogorov–Arnold Network (KAN) using a novel SA-DSD framework. SA-DSD employs self-attention to dynamically identify informative nodes and reweight distillation signals, compensating for the lack of explicit neighborhood aggregation in KANs. The FR-KAN+ student, with learnable frequency bases, complex weights, and phase shifts, achieves substantial compression (≈16.69× fewer parameters) and runtime reductions (≈55.75% per epoch) while improving predictive accuracy by up to 3.62% over GNN teachers and 15.61% over FR-KAN+. Experiments on six real-world datasets under both inductive and transductive settings demonstrate strong, architecture-aware knowledge transfer and practical edge deployment potential.

Abstract

Recent success of graph neural networks (GNNs) in modeling complex graph-structured data has fueled interest in deploying them on resource-constrained edge devices. However, their substantial computational and memory demands present ongoing challenges. Knowledge distillation (KD) from GNNs to MLPs offers a lightweight alternative, but MLPs remain limited by fixed activations and the absence of neighborhood aggregation, constraining distilled performance. To tackle these intertwined limitations, we propose SA-DSD, a novel self-attention-guided dynamic sampling distillation framework. To the best of our knowledge, this is the first work to employ an enhanced Kolmogorov-Arnold Network (KAN) as the student model. We improve Fourier KAN (FR-KAN+) with learnable frequency bases, phase shifts, and optimized algorithms, substantially improving nonlinear fitting capability over MLPs while preserving low computational complexity. To explicitly compensate for the absence of neighborhood aggregation that is inherent to both MLPs and KAN-based students, SA-DSD leverages a self-attention mechanism to dynamically identify influential nodes, construct adaptive sampling probability matrices, and enforce teacher-student prediction consistency. Extensive experiments on six real world datasets demonstrate that, under inductive and most of transductive settings, SA-DSD surpasses three GNN teachers by 3.05%-3.62% and improves FR-KAN+ by 15.61%. Moreover, it achieves a 16.69x parameter reduction and a 55.75% decrease in average runtime per epoch compared to key benchmarks.

Efficient Graph Knowledge Distillation from GNNs to Kolmogorov--Arnold Networks via Self-Attention Dynamic Sampling

TL;DR

This work tackles the challenge of deploying graph models on resource-constrained devices by distilling GNN knowledge into a Fourier-based Kolmogorov–Arnold Network (KAN) using a novel SA-DSD framework. SA-DSD employs self-attention to dynamically identify informative nodes and reweight distillation signals, compensating for the lack of explicit neighborhood aggregation in KANs. The FR-KAN+ student, with learnable frequency bases, complex weights, and phase shifts, achieves substantial compression (≈16.69× fewer parameters) and runtime reductions (≈55.75% per epoch) while improving predictive accuracy by up to 3.62% over GNN teachers and 15.61% over FR-KAN+. Experiments on six real-world datasets under both inductive and transductive settings demonstrate strong, architecture-aware knowledge transfer and practical edge deployment potential.

Abstract

Recent success of graph neural networks (GNNs) in modeling complex graph-structured data has fueled interest in deploying them on resource-constrained edge devices. However, their substantial computational and memory demands present ongoing challenges. Knowledge distillation (KD) from GNNs to MLPs offers a lightweight alternative, but MLPs remain limited by fixed activations and the absence of neighborhood aggregation, constraining distilled performance. To tackle these intertwined limitations, we propose SA-DSD, a novel self-attention-guided dynamic sampling distillation framework. To the best of our knowledge, this is the first work to employ an enhanced Kolmogorov-Arnold Network (KAN) as the student model. We improve Fourier KAN (FR-KAN+) with learnable frequency bases, phase shifts, and optimized algorithms, substantially improving nonlinear fitting capability over MLPs while preserving low computational complexity. To explicitly compensate for the absence of neighborhood aggregation that is inherent to both MLPs and KAN-based students, SA-DSD leverages a self-attention mechanism to dynamically identify influential nodes, construct adaptive sampling probability matrices, and enforce teacher-student prediction consistency. Extensive experiments on six real world datasets demonstrate that, under inductive and most of transductive settings, SA-DSD surpasses three GNN teachers by 3.05%-3.62% and improves FR-KAN+ by 15.61%. Moreover, it achieves a 16.69x parameter reduction and a 55.75% decrease in average runtime per epoch compared to key benchmarks.

Paper Structure

This paper contains 21 sections, 13 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Visualization of (a) computational complexity comparison and (b) inference time comparison.
  • Figure 2: Architecture comparison between deep MLPs and KANs.
  • Figure 3: Overall framework diagram of SA-DSD.
  • Figure 4: SA-DSD vs. KRD in terms of number of parameters and runtime.
  • Figure 5: The loss curves of SA-DSD and FR-KAN+ on six datasets.
  • ...and 3 more figures