Table of Contents
Fetching ...

Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts

Anastasis Kratsios, Haitz Sáez de Ocáriz Borde, Takashi Furuya, Marc T. Law

TL;DR

This work analyzes mixtures of small MLP experts (MoMLP) with (P)ReLU activation in a routing-based MoE framework. It proves that any Lipschitz function on the unit cube can be uniformly approximated to error $\varepsilon$ by a MoMLP while keeping the number of physically loaded parameters in memory at $\mathcal{O}(\varepsilon^{-1})$, leveraging a tree-based routing among locally specialized Experts. It also establishes a finite VC-dimension bound for the MoMLP class, showing potential generalization, and characterizes a trade-off between the number of experts and per-expert complexity via a parameter $r$, with implications for VRAM budgeted deployment. Additionally, the paper discusses the contrast with super-expressive activations (which can yield infinite VC dimension) and outlines practical guidance for complexity control through a distributed MoE design, including a vector-valued universal-approximation framework and a delta-net routing strategy.

Abstract

Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single "expert" deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MLPs with (P)ReLU activation functions. We first prove that for every error level $\varepsilon>0$ and every Lipschitz function $f:[0,1]^n\to \mathbb{R}$, one can construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which uniformly approximates $f$ to $\varepsilon$ accuracy over $[0,1]^n$, while only requiring networks of $\mathcal{O}(\varepsilon^{-1})$ parameters to be loaded in memory. Additionally, we show that MoMLPs can generalize since the entire MoMLP model has a (finite) VC dimension of $\tilde{O}(L\max\{nL,JW\})$, if there are $L$ experts and each expert has a depth and width of $J$ and $W$, respectively.

Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts

TL;DR

This work analyzes mixtures of small MLP experts (MoMLP) with (P)ReLU activation in a routing-based MoE framework. It proves that any Lipschitz function on the unit cube can be uniformly approximated to error by a MoMLP while keeping the number of physically loaded parameters in memory at , leveraging a tree-based routing among locally specialized Experts. It also establishes a finite VC-dimension bound for the MoMLP class, showing potential generalization, and characterizes a trade-off between the number of experts and per-expert complexity via a parameter , with implications for VRAM budgeted deployment. Additionally, the paper discusses the contrast with super-expressive activations (which can yield infinite VC dimension) and outlines practical guidance for complexity control through a distributed MoE design, including a vector-valued universal-approximation framework and a delta-net routing strategy.

Abstract

Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single "expert" deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MLPs with (P)ReLU activation functions. We first prove that for every error level and every Lipschitz function , one can construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which uniformly approximates to accuracy over , while only requiring networks of parameters to be loaded in memory. Additionally, we show that MoMLPs can generalize since the entire MoMLP model has a (finite) VC dimension of , if there are experts and each expert has a depth and width of and , respectively.
Paper Structure (39 sections, 8 theorems, 57 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 39 sections, 8 theorems, 57 equations, 3 figures, 8 tables, 1 algorithm.

Key Result

Theorem 4.1

Suppose that $\sigma$ satisfies Definition defn:PReLU. Fix an "number of experts-to-expert complexity trade-off parameter" $r\in \mathbb{R}$. For every $\alpha$-Hölder map $f: \overline{B}_n(0,1) \to \mathbb{R}^m$ with $0<\alpha\le 1$ and each approximation error $\varepsilon>0$, there is a $p\in \m and for each $x\in {\mathcal{K}}$ and $i=1,\dots,L$, if $\|x-v_i\|<\delta$ then The depth and wid

Figures (3)

  • Figure 1: $1$) The distance from each input $x$ to all prototypes $p_1,\dots,p_8$ ($\ell=8$) is queried. $2$) The network ($\hat{f}_2$ in the figure) assigned to the nearest prototype ($p_2$), is loaded onto the GPU and used for prediction.
  • Figure 2: Visual Comparison of Functions with High ($\alpha\approx 1$) vs. Low ($\alpha\approx 0$) Hölder regularity. If $\alpha\approx 1$, the function (green) is approximately differentiable almost everywhere, meaning it does not osculate much locally and thus is simple to approximate. If $\alpha\approx 0$, the function may be nowhere differentiable and jagged; its extreme details make it difficult to approximate.
  • Figure 3: Comparison of ground truth and predicted results for 2D Ackley and Rastrigin functions over the domain $[-1,1]^2$.

Theorems & Definitions (21)

  • Definition 3.1: Trainable PReLU
  • Definition 3.2: MoMLPs
  • Remark 3.3: Partitioning in \ref{['eq:paritioning']} in classical computer science
  • Theorem 4.1: Trade-Off: No. Expert vs. Expert Complexity
  • Theorem 4.2: VC-Dimension Bounds for MoMLPs - MoMLPs Can Generalize
  • Definition 4.3: Trainable Super-Expressive Activation Function
  • Proposition 4.4: MLPs with Super-Expressive Activation Do Not Generalize
  • Example 1: Subsets of Euclidean Spaces
  • Lemma 5.1: Size of a Tree Whose Nodes Form a $\delta$-net of a Compact Subset of $\mathbb{R}^n$
  • Lemma 5.2: Vector-Valued Universal Approximation Theorem with Explicit Diameter Dependence
  • ...and 11 more