Table of Contents
Fetching ...

Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs

Ali Khalesi, Mohammad Reza Deylam Salehi

TL;DR

A communication-theoretic view of MoE gating is adopted, modeling the gate as a stochastic channel operating under a finite information rate, which yields capacity-aware limits for communication-constrained MoE systems.

Abstract

Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, we specialize a mutual-information generalization bound and develop a rate-distortion characterization $D(R_g)$ of finite-rate gating, where $R_g:=I(X; T)$, yielding (under a standard empirical rate-distortion optimality condition) $\mathbb{E}[R(W)] \le D(R_g)+δ_m+\sqrt{(2/m)\, I(S; W)}$. The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.

Mixture-of-Experts under Finite-Rate Gating: Communication--Generalization Trade-offs

TL;DR

A communication-theoretic view of MoE gating is adopted, modeling the gate as a stochastic channel operating under a finite information rate, which yields capacity-aware limits for communication-constrained MoE systems.

Abstract

Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, we specialize a mutual-information generalization bound and develop a rate-distortion characterization of finite-rate gating, where , yielding (under a standard empirical rate-distortion optimality condition) . The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.
Paper Structure (10 sections, 2 theorems, 32 equations, 3 figures)

This paper contains 10 sections, 2 theorems, 32 equations, 3 figures.

Key Result

Theorem 1

Assume $\ell \in [0,1]$. Let $W$ be the (possibly randomized) parameters produced by a learning algorithm trained on sample $S=\{(x_j,y_j)\}_{j=1}^m\sim \mathcal{D}^m$. Then, for any sample size $m \ge 1$, the expected generalization gap satisfies where $I(S; W)$ in (eq:itmoe) is the mutual information between the training sample and the learned parametersUnless stated otherwise, all mutual infor

Figures (3)

  • Figure 1: MoE system viewed as a finite-rate stochastic communication link. The input feature vector $X$ is processed by a gating module implementing a channel $P(T \mid X; W_{\text{gate}})$ that maps $X$ to a routing index $T\in[n]$ under an information-rate constraint. The index selects an expert ${h_g(\cdot;W_g)}_{g=1}^n$, yielding the prediction $\hat{Y}=h_T(X;W_T)$. Here $I(X;T)$ is the effective communication rate to the expert bank, so the gate acts as a constrained link that limits how much information about $X$ reaches the experts, shaping expressivity and generalization.
  • Figure 2: MoE simulation for Theorem \ref{['th:itmoe']}. The empirical generalization gap $\lvert\mathbb{E}[R-R_S]\rvert$ increases with the information term $\sqrt{2 I(S; W)/m}$ and remains below the theoretical upper bound, illustrating the info-generalization trade-off at the algorithm level.
  • Figure 3: BSC rate-distortion-generalization experiment for Theorem \ref{['th:rdmoe']}. The empirical mean population risk $\mathbb{E}[R(W)]$ (dots) closely follows the rate-distortion curve $D(R_g)$, while the bound in (\ref{['eq-eval-bound']}) (triangles) remains safely above, confirming the theoretical trade-off between gating rate and prediction accuracy.

Theorems & Definitions (8)

  • Theorem 1: Xu--Raginsky bound specialized to MoE
  • Remark 1: Communication interpretation
  • Remark 2: On estimating $I(X;T)$ and $I(S;W)$ in practice
  • Theorem 2: Rate-Distortion-Generalization Bound
  • Remark 3: Generalization trade-off in MoE under local privacy
  • Remark 4: Capacity-limited MoE gating
  • proof
  • proof