Table of Contents
Fetching ...

Mixture of Weak & Strong Experts on Graphs

Hanqing Zeng, Hanjia Lyu, Diyi Hu, Yinglong Xia, Jiebo Luo

TL;DR

Mixture of Weak & Strong Experts on Graphs (Mowst) introduces a per-node gating mechanism that combines a lightweight MLP (weak expert) with a strong GNN (strong expert) to separately address node features and neighborhood structure. The gate, based on dispersion of the MLP's logits, yields per-node confidence C that balances the two experts via a loss L_Mowst, while a training strategy promotes iterative specialization and, in some variants, denoised fine-tuning of the GNN. The framework is proven to be at least as expressive as any single expert, with Mowst-GCN offering strictly greater expressivity than GCN, and is designed to maintain computation close to a single GNN. Empirically, Mowst and its variant Mowst* achieve significant accuracy gains across six benchmarks with four backbones, across both homophilous and heterophilous graphs, highlighting the practical impact of decoupling features and structure through a simple, scalable MoE design.

Abstract

Realistic graphs contain both (1) rich self-features of nodes and (2) informative structures of neighborhoods, jointly handled by a Graph Neural Network (GNN) in the typical setup. We propose to decouple the two modalities by Mixture of weak and strong experts (Mowst), where the weak expert is a light-weight Multi-layer Perceptron (MLP), and the strong expert is an off-the-shelf GNN. To adapt the experts' collaboration to different target nodes, we propose a "confidence" mechanism based on the dispersion of the weak expert's prediction logits. The strong expert is conditionally activated in the low-confidence region when either the node's classification relies on neighborhood information, or the weak expert has low model quality. We reveal interesting training dynamics by analyzing the influence of the confidence function on loss: our training algorithm encourages the specialization of each expert by effectively generating soft splitting of the graph. In addition, our "confidence" design imposes a desirable bias toward the strong expert to benefit from GNN's better generalization capability. Mowst is easy to optimize and achieves strong expressive power, with a computation cost comparable to a single GNN. Empirically, Mowst on 4 backbone GNN architectures show significant accuracy improvement on 6 standard node classification benchmarks, including both homophilous and heterophilous graphs (https://github.com/facebookresearch/mowst-gnn).

Mixture of Weak & Strong Experts on Graphs

TL;DR

Mixture of Weak & Strong Experts on Graphs (Mowst) introduces a per-node gating mechanism that combines a lightweight MLP (weak expert) with a strong GNN (strong expert) to separately address node features and neighborhood structure. The gate, based on dispersion of the MLP's logits, yields per-node confidence C that balances the two experts via a loss L_Mowst, while a training strategy promotes iterative specialization and, in some variants, denoised fine-tuning of the GNN. The framework is proven to be at least as expressive as any single expert, with Mowst-GCN offering strictly greater expressivity than GCN, and is designed to maintain computation close to a single GNN. Empirically, Mowst and its variant Mowst* achieve significant accuracy gains across six benchmarks with four backbones, across both homophilous and heterophilous graphs, highlighting the practical impact of decoupling features and structure through a simple, scalable MoE design.

Abstract

Realistic graphs contain both (1) rich self-features of nodes and (2) informative structures of neighborhoods, jointly handled by a Graph Neural Network (GNN) in the typical setup. We propose to decouple the two modalities by Mixture of weak and strong experts (Mowst), where the weak expert is a light-weight Multi-layer Perceptron (MLP), and the strong expert is an off-the-shelf GNN. To adapt the experts' collaboration to different target nodes, we propose a "confidence" mechanism based on the dispersion of the weak expert's prediction logits. The strong expert is conditionally activated in the low-confidence region when either the node's classification relies on neighborhood information, or the weak expert has low model quality. We reveal interesting training dynamics by analyzing the influence of the confidence function on loss: our training algorithm encourages the specialization of each expert by effectively generating soft splitting of the graph. In addition, our "confidence" design imposes a desirable bias toward the strong expert to benefit from GNN's better generalization capability. Mowst is easy to optimize and achieves strong expressive power, with a computation cost comparable to a single GNN. Empirically, Mowst on 4 backbone GNN architectures show significant accuracy improvement on 6 standard node classification benchmarks, including both homophilous and heterophilous graphs (https://github.com/facebookresearch/mowst-gnn).
Paper Structure (75 sections, 21 theorems, 45 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 75 sections, 21 theorems, 45 equations, 4 figures, 5 tables, 2 algorithms.

Key Result

Proposition 2.2

$C=G\circ D$ is quasiconvex.

Figures (4)

  • Figure 1: Design overview of Mowst. The full system is composed of a weak expert, a strong expert, and a gating module. Diverse collaboration behaviors between the weak & strong experts emerge as a result of the gating module's coordination. The gating function, which can be either manually defined or automatically learned (via an additional compact MLP), calculates a confidence score based on the dispersion of only the weak expert's prediction logits. The confidence score varies across different target nodes depending on the experts' relative strength on the local graph region. The score also directly controls how each expert's own logits are combined into the system's final prediction.
  • Figure 2: Evolution of the $C$ distribution.
  • Figure 3: Test accuracy comparison between "weak-strong" and "strong-strong".
  • Figure 4: t-SNE visualization on GNN embeddings for Flickr.

Theorems & Definitions (37)

  • Definition 2.1
  • Proposition 2.2
  • Proposition 2.3
  • Theorem 2.4
  • Proposition 2.5
  • Proposition 2.6
  • Theorem 2.7
  • Proposition 2.8
  • Definition D.1
  • Definition D.2
  • ...and 27 more