Table of Contents
Fetching ...

TINED: GNNs-to-MLPs by Teacher Injection and Dirichlet Energy Distillation

Ziang Zhou, Zhihao Ding, Jieming Shi, Qing Li, Shiqi Shen

TL;DR

TINED tackles the bottleneck of GNN latency by distilling layer-wise knowledge from a GNN teacher to an MLP student. It introduces Teacher Injection to transplant FT and emulate GP via FC layers, plus Dirichlet Energy Distillation to preserve GNN smoothing patterns through DE ratios. The method is backed by a theoretical GP-to-FC approximation bound and a combined loss that includes $\mathcal{L}_{CE}$, $\lambda\mathcal{L}_{KL}$, and $\beta\mathcal{L}_{DED}$. Empirical evaluations across seven datasets show that TINED and its graph-enabled variant TINED+ outperform baselines and even the teacher in prod settings, while offering orders-of-magnitude faster inference. The work advances practical, high-accuracy GNN deployment by leveraging fine-grained, layer-aware knowledge transfer and smoothing dynamics.

Abstract

Graph Neural Networks (GNNs) are pivotal in graph-based learning, particularly excelling in node classification. However, their scalability is hindered by the need for multi-hop data during inference, limiting their application in latency-sensitive scenarios. Recent efforts to distill GNNs into multi-layer perceptrons (MLPs) for faster inference often underutilize the layer-level insights of GNNs. In this paper, we present TINED, a novel approach that distills GNNs to MLPs on a layer-by-layer basis using Teacher Injection and Dirichlet Energy Distillation techniques. We focus on two key operations in GNN layers: feature transformation (FT) and graph propagation (GP). We recognize that FT is computationally equivalent to a fully-connected (FC) layer in MLPs. Thus, we propose directly transferring teacher parameters from an FT in a GNN to an FC layer in the student MLP, enhanced by fine-tuning. In TINED, the FC layers in an MLP replicate the sequence of FTs and GPs in the GNN. We also establish a theoretical bound for GP approximation. Furthermore, we note that FT and GP operations in GNN layers often exhibit opposing smoothing effects: GP is aggressive, while FT is conservative. Using Dirichlet energy, we develop a DE ratio to measure these effects and propose Dirichlet Energy Distillation to convey these characteristics from GNN layers to MLP layers. Extensive experiments show that TINED outperforms GNNs and leading distillation methods across various settings and seven datasets. Source code are available at https://github.com/scottjiao/TINED_ICML25/.

TINED: GNNs-to-MLPs by Teacher Injection and Dirichlet Energy Distillation

TL;DR

TINED tackles the bottleneck of GNN latency by distilling layer-wise knowledge from a GNN teacher to an MLP student. It introduces Teacher Injection to transplant FT and emulate GP via FC layers, plus Dirichlet Energy Distillation to preserve GNN smoothing patterns through DE ratios. The method is backed by a theoretical GP-to-FC approximation bound and a combined loss that includes , , and . Empirical evaluations across seven datasets show that TINED and its graph-enabled variant TINED+ outperform baselines and even the teacher in prod settings, while offering orders-of-magnitude faster inference. The work advances practical, high-accuracy GNN deployment by leveraging fine-grained, layer-aware knowledge transfer and smoothing dynamics.

Abstract

Graph Neural Networks (GNNs) are pivotal in graph-based learning, particularly excelling in node classification. However, their scalability is hindered by the need for multi-hop data during inference, limiting their application in latency-sensitive scenarios. Recent efforts to distill GNNs into multi-layer perceptrons (MLPs) for faster inference often underutilize the layer-level insights of GNNs. In this paper, we present TINED, a novel approach that distills GNNs to MLPs on a layer-by-layer basis using Teacher Injection and Dirichlet Energy Distillation techniques. We focus on two key operations in GNN layers: feature transformation (FT) and graph propagation (GP). We recognize that FT is computationally equivalent to a fully-connected (FC) layer in MLPs. Thus, we propose directly transferring teacher parameters from an FT in a GNN to an FC layer in the student MLP, enhanced by fine-tuning. In TINED, the FC layers in an MLP replicate the sequence of FTs and GPs in the GNN. We also establish a theoretical bound for GP approximation. Furthermore, we note that FT and GP operations in GNN layers often exhibit opposing smoothing effects: GP is aggressive, while FT is conservative. Using Dirichlet energy, we develop a DE ratio to measure these effects and propose Dirichlet Energy Distillation to convey these characteristics from GNN layers to MLP layers. Extensive experiments show that TINED outperforms GNNs and leading distillation methods across various settings and seven datasets. Source code are available at https://github.com/scottjiao/TINED_ICML25/.

Paper Structure

This paper contains 24 sections, 1 theorem, 19 equations, 9 figures, 13 tables.

Key Result

Theorem 4.1

For a sparse matrix $\mathbf{L}\xspace\in \mathbb{R}^{n\xspace\times n\xspace}$ and a feature matrix $\mathbf{H}\in \mathbb{R}^{n\xspace\times d}$ with ${rank}(\mathbf{H})=d$, there exists a transformation matrix $\mathbf{W}^*$ to approximate $\mathbf{L}\xspace\mathbf{H}\xspace$ by $\mathbf{H}\xspac where $||\cdot||_F$ is the Frobenius norm and $\lambda_{\max}(\mathbf{L})$ is the largest eigenvalu

Figures (9)

  • Figure 1: The DE ratios of FTs and GPs in the layers of GraphSAGE.
  • Figure 2: (a) TINED with Teacher Injection and Dirichlet Energy Distillation; (b) Inference settings
  • Figure 3: Inference Time and Accuracy
  • Figure 4: Accuracy on Different Teacher GNNs
  • Figure 5: t-SNE of model embeddings at different training stages on Citeseer.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 4.1
  • Definition 4.2
  • Definition 4.3: DE ratio
  • proof