Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, Jing Xiong, Kashif Rasul, Mac Schwager, Anderson Schneider, Zhangyang Wang, Yuriy Nevmyvaka
TL;DR
This work challenges the long-standing reliance on Softmax as the router in Mixture-of-Experts (MoE) by reframing MoE routing through Nadaraya-Watson regression. It shows that FFN and MoE can be seen as parametric NW regressions, then introduces KERN, a kernel-inspired, zero-additional-cost router that uses ReLU activation and $ell_2$-normalization to provide FFN-style gating without constraining outputs to a probability simplex. Empirically, KERN consistently outperforms Softmax and other routers across diverse datasets, training lengths, context sizes, granularity (numbers of experts), sparsity levels, and large-scale pretraining, including up to 6.9B-parameter regimes. The work demonstrates that a principled, FFN-aligned routing design can yield stable training, better expert utilization, and meaningful performance gains, positioning KERN as a strong baseline and potential replacement for Softmax in future MoE architectures.
Abstract
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.
