Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts
Anastasis Kratsios, Haitz Sáez de Ocáriz Borde, Takashi Furuya, Marc T. Law
TL;DR
This work analyzes mixtures of small MLP experts (MoMLP) with (P)ReLU activation in a routing-based MoE framework. It proves that any Lipschitz function on the unit cube can be uniformly approximated to error $\varepsilon$ by a MoMLP while keeping the number of physically loaded parameters in memory at $\mathcal{O}(\varepsilon^{-1})$, leveraging a tree-based routing among locally specialized Experts. It also establishes a finite VC-dimension bound for the MoMLP class, showing potential generalization, and characterizes a trade-off between the number of experts and per-expert complexity via a parameter $r$, with implications for VRAM budgeted deployment. Additionally, the paper discusses the contrast with super-expressive activations (which can yield infinite VC dimension) and outlines practical guidance for complexity control through a distributed MoE design, including a vector-valued universal-approximation framework and a delta-net routing strategy.
Abstract
Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single "expert" deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MLPs with (P)ReLU activation functions. We first prove that for every error level $\varepsilon>0$ and every Lipschitz function $f:[0,1]^n\to \mathbb{R}$, one can construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which uniformly approximates $f$ to $\varepsilon$ accuracy over $[0,1]^n$, while only requiring networks of $\mathcal{O}(\varepsilon^{-1})$ parameters to be loaded in memory. Additionally, we show that MoMLPs can generalize since the entire MoMLP model has a (finite) VC dimension of $\tilde{O}(L\max\{nL,JW\})$, if there are $L$ experts and each expert has a depth and width of $J$ and $W$, respectively.
