Table of Contents
Fetching ...

Graph Knowledge Distillation to Mixture of Experts

Pavel Rumiantsev, Mark Coates

TL;DR

This model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization, and it is demonstrated experimentally that it is possible to derive considerably more consistent performance across multiple datasets.

Abstract

In terms of accuracy, Graph Neural Networks (GNNs) are the best architectural choice for the node classification task. Their drawback in real-world deployment is the latency that emerges from the neighbourhood processing operation. One solution to the latency issue is to perform knowledge distillation from a trained GNN to a Multi-Layer Perceptron (MLP), where the MLP processes only the features of the node being classified (and possibly some pre-computed structural information). However, the performance of such MLPs in both transductive and inductive settings remains inconsistent for existing knowledge distillation techniques. We propose to address the performance concerns by using a specially-designed student model instead of an MLP. Our model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization. By encouraging each expert to specialize on a certain region on the hidden representation space, we demonstrate experimentally that it is possible to derive considerably more consistent performance across multiple datasets. Code available at https://github.com/Rufaim/routing-by-memory.

Graph Knowledge Distillation to Mixture of Experts

TL;DR

This model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization, and it is demonstrated experimentally that it is possible to derive considerably more consistent performance across multiple datasets.

Abstract

In terms of accuracy, Graph Neural Networks (GNNs) are the best architectural choice for the node classification task. Their drawback in real-world deployment is the latency that emerges from the neighbourhood processing operation. One solution to the latency issue is to perform knowledge distillation from a trained GNN to a Multi-Layer Perceptron (MLP), where the MLP processes only the features of the node being classified (and possibly some pre-computed structural information). However, the performance of such MLPs in both transductive and inductive settings remains inconsistent for existing knowledge distillation techniques. We propose to address the performance concerns by using a specially-designed student model instead of an MLP. Our model, named Routing-by-Memory (RbM), is a form of Mixture-of-Experts (MoE), with a design that enforces expert specialization. By encouraging each expert to specialize on a certain region on the hidden representation space, we demonstrate experimentally that it is possible to derive considerably more consistent performance across multiple datasets. Code available at https://github.com/Rufaim/routing-by-memory.
Paper Structure (23 sections, 14 equations, 5 figures, 10 tables)

This paper contains 23 sections, 14 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: (a) An overview of the overall training framework. A teacher GNN is trained on the graph and provides targets for Knowledge Distillation (KD) (\ref{['eq:loss_kd']}) and Knowledge-Aware Reliable Distillation (KRD) (\ref{['eq:loss_krd']}) losses. A Mixture-of-Experts student is trained on the node features and positional encoding (see Section \ref{['sec:distillation']}). (b) We use three additional losses to adjust the internal representations of the model, as the embeddings we use for routing (see Section \ref{['sec:routing']}). We provide schematic representations of these losses to aid intuitive understanding. Commitment loss (\ref{['eq:commitment_loss']}) pulls representations closer to embeddings (highlighted in blue). Self-similarity loss (\ref{['eq:self-similarity_loss']}) prevents collapse of representations. Load balance loss (\ref{['eq:load_balance_loss']}) helps to move borderline representations towards the embeddings of the less populated experts.
  • Figure 2: A simplified example of cosine routing (\ref{['eq:rbm_routing']}). Three experts are present in total ($E=3$). Two experts are used at a time ($k=2$), and thus the two experts with closest embeddings are used. Arrows show expert embeddings on the unit circle. Points are representations of the previously routed training examples (see equation \ref{['eq:rbm_update']}).
  • Figure 3: Schematic depiction of a student model with two RbM layers. Three experts are present per layer with two experts used for each sample.
  • Figure 4: Analysis of hidden representation for RbM (b) and its projection for MoE (a). Each point represent an instance from the Academic-Physics dataset in transductive setting.
  • Figure 5: Test set accuracy with respect to the number of experts/clusters for RbM on OGB-ArXive dataset. The optimal number of clusters (5) is clearly identifiable in both transductive and inductive cases.