Graph Knowledge Distillation to Mixture of Experts

Pavel Rumiantsev; Mark Coates

Graph Knowledge Distillation to Mixture of Experts

Pavel Rumiantsev, Mark Coates

TL;DR

This model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization, and it is demonstrated experimentally that it is possible to derive considerably more consistent performance across multiple datasets.

Abstract

In terms of accuracy, Graph Neural Networks (GNNs) are the best architectural choice for the node classification task. Their drawback in real-world deployment is the latency that emerges from the neighbourhood processing operation. One solution to the latency issue is to perform knowledge distillation from a trained GNN to a Multi-Layer Perceptron (MLP), where the MLP processes only the features of the node being classified (and possibly some pre-computed structural information). However, the performance of such MLPs in both transductive and inductive settings remains inconsistent for existing knowledge distillation techniques. We propose to address the performance concerns by using a specially-designed student model instead of an MLP. Our model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization. By encouraging each expert to specialize on a certain region on the hidden representation space, we demonstrate experimentally that it is possible to derive considerably more consistent performance across multiple datasets. Code available at https://github.com/Rufaim/routing-by-memory.

Graph Knowledge Distillation to Mixture of Experts

TL;DR

Abstract

Paper Structure (23 sections, 14 equations, 5 figures, 10 tables)

This paper contains 23 sections, 14 equations, 5 figures, 10 tables.

Introduction
Related work
GNN-to-MLP Knowledge Distillation
Mixture-of-Experts
Background
Methodology
Spatial routing by memory
Knowledge Distillation
MoE initialization
Experiments
Experimental setting
Performance comparison
Comparing with ensemble and vanilla MoE
Ablation study, label propagation, and number of experts
Routing spatial structure analysis
...and 8 more sections

Figures (5)

Figure 1: (a) An overview of the overall training framework. A teacher GNN is trained on the graph and provides targets for Knowledge Distillation (KD) (\ref{['eq:loss_kd']}) and Knowledge-Aware Reliable Distillation (KRD) (\ref{['eq:loss_krd']}) losses. A Mixture-of-Experts student is trained on the node features and positional encoding (see Section \ref{['sec:distillation']}). (b) We use three additional losses to adjust the internal representations of the model, as the embeddings we use for routing (see Section \ref{['sec:routing']}). We provide schematic representations of these losses to aid intuitive understanding. Commitment loss (\ref{['eq:commitment_loss']}) pulls representations closer to embeddings (highlighted in blue). Self-similarity loss (\ref{['eq:self-similarity_loss']}) prevents collapse of representations. Load balance loss (\ref{['eq:load_balance_loss']}) helps to move borderline representations towards the embeddings of the less populated experts.
Figure 2: A simplified example of cosine routing (\ref{['eq:rbm_routing']}). Three experts are present in total ($E=3$). Two experts are used at a time ($k=2$), and thus the two experts with closest embeddings are used. Arrows show expert embeddings on the unit circle. Points are representations of the previously routed training examples (see equation \ref{['eq:rbm_update']}).
Figure 3: Schematic depiction of a student model with two RbM layers. Three experts are present per layer with two experts used for each sample.
Figure 4: Analysis of hidden representation for RbM (b) and its projection for MoE (a). Each point represent an instance from the Academic-Physics dataset in transductive setting.
Figure 5: Test set accuracy with respect to the number of experts/clusters for RbM on OGB-ArXive dataset. The optimal number of clusters (5) is clearly identifiable in both transductive and inductive cases.

Graph Knowledge Distillation to Mixture of Experts

TL;DR

Abstract

Graph Knowledge Distillation to Mixture of Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (5)