Mixture of Weak & Strong Experts on Graphs
Hanqing Zeng, Hanjia Lyu, Diyi Hu, Yinglong Xia, Jiebo Luo
TL;DR
Mixture of Weak & Strong Experts on Graphs (Mowst) introduces a per-node gating mechanism that combines a lightweight MLP (weak expert) with a strong GNN (strong expert) to separately address node features and neighborhood structure. The gate, based on dispersion of the MLP's logits, yields per-node confidence C that balances the two experts via a loss L_Mowst, while a training strategy promotes iterative specialization and, in some variants, denoised fine-tuning of the GNN. The framework is proven to be at least as expressive as any single expert, with Mowst-GCN offering strictly greater expressivity than GCN, and is designed to maintain computation close to a single GNN. Empirically, Mowst and its variant Mowst* achieve significant accuracy gains across six benchmarks with four backbones, across both homophilous and heterophilous graphs, highlighting the practical impact of decoupling features and structure through a simple, scalable MoE design.
Abstract
Realistic graphs contain both (1) rich self-features of nodes and (2) informative structures of neighborhoods, jointly handled by a Graph Neural Network (GNN) in the typical setup. We propose to decouple the two modalities by Mixture of weak and strong experts (Mowst), where the weak expert is a light-weight Multi-layer Perceptron (MLP), and the strong expert is an off-the-shelf GNN. To adapt the experts' collaboration to different target nodes, we propose a "confidence" mechanism based on the dispersion of the weak expert's prediction logits. The strong expert is conditionally activated in the low-confidence region when either the node's classification relies on neighborhood information, or the weak expert has low model quality. We reveal interesting training dynamics by analyzing the influence of the confidence function on loss: our training algorithm encourages the specialization of each expert by effectively generating soft splitting of the graph. In addition, our "confidence" design imposes a desirable bias toward the strong expert to benefit from GNN's better generalization capability. Mowst is easy to optimize and achieves strong expressive power, with a computation cost comparable to a single GNN. Empirically, Mowst on 4 backbone GNN architectures show significant accuracy improvement on 6 standard node classification benchmarks, including both homophilous and heterophilous graphs (https://github.com/facebookresearch/mowst-gnn).
