Table of Contents
Fetching ...

Mixture of Experts Softens the Curse of Dimensionality in Operator Learning

Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop

TL;DR

The paper tackles the challenge of scaling operator learning to infinite-dimensional spaces by introducing a distributed mixture of neural operators (MoNO) arranged as a rooted routing tree. A central result is a distributed universal approximation theorem: any Lipschitz operator between Sobolev-function spaces on $[0,1]^d$ can be uniformly approximated on a Sobolev unit ball by a MoNO, with each active expert requiring depth, width, and rank that scale as $O(\varepsilon^{-1})$, while the total model grows with the number of leaves. This framework distributes the curse of dimensionality across many small, local experts, enabling feasible memory usage and on-demand loading of parameters, which supports scalable inference and training. Additional contributions include quantitative rates for classical neural operators and a detailed constructive proof that combines finite-dimensional encodings, rank-truncated integral-like mappings, and an efficient tree-based routing strategy. The results have practical implications for high-dimensional operator learning, offering a principled route to deploy large MoNOs on hardware with memory constraints and to apply operator learning to inverse problems and other infinite-dimensional settings. \n

Abstract

We study the approximation-theoretic implications of mixture-of-experts architectures for operator learning, where the complexity of a single large neural operator is distributed across many small neural operators (NOs), and each input is routed to exactly one NO via a decision tree. We analyze how this tree-based routing and expert decomposition affect approximation power, sample complexity, and stability. Our main result is a distributed universal approximation theorem for mixture of neural operators (MoNOs): any Lipschitz nonlinear operator between $L^2([0,1]^d)$ spaces can be uniformly approximated over the Sobolev unit ball to arbitrary accuracy $\varepsilon>0$ by an MoNO, where each expert NO has a depth, width, and rank scaling as $\mathcal{O}(\varepsilon^{-1})$. Although the number of experts may grow with accuracy, each NO remains small, enough to fit within active memory of standard hardware for reasonable accuracy levels. Our analysis also yields new quantitative approximation rates for classical NOs approximating uniformly continuous nonlinear operators uniformly on compact subsets of $L^2([0,1]^d)$.

Mixture of Experts Softens the Curse of Dimensionality in Operator Learning

TL;DR

The paper tackles the challenge of scaling operator learning to infinite-dimensional spaces by introducing a distributed mixture of neural operators (MoNO) arranged as a rooted routing tree. A central result is a distributed universal approximation theorem: any Lipschitz operator between Sobolev-function spaces on can be uniformly approximated on a Sobolev unit ball by a MoNO, with each active expert requiring depth, width, and rank that scale as , while the total model grows with the number of leaves. This framework distributes the curse of dimensionality across many small, local experts, enabling feasible memory usage and on-demand loading of parameters, which supports scalable inference and training. Additional contributions include quantitative rates for classical neural operators and a detailed constructive proof that combines finite-dimensional encodings, rank-truncated integral-like mappings, and an efficient tree-based routing strategy. The results have practical implications for high-dimensional operator learning, offering a principled route to deploy large MoNOs on hardware with memory constraints and to apply operator learning to inverse problems and other infinite-dimensional settings. \n

Abstract

We study the approximation-theoretic implications of mixture-of-experts architectures for operator learning, where the complexity of a single large neural operator is distributed across many small neural operators (NOs), and each input is routed to exactly one NO via a decision tree. We analyze how this tree-based routing and expert decomposition affect approximation power, sample complexity, and stability. Our main result is a distributed universal approximation theorem for mixture of neural operators (MoNOs): any Lipschitz nonlinear operator between spaces can be uniformly approximated over the Sobolev unit ball to arbitrary accuracy by an MoNO, where each expert NO has a depth, width, and rank scaling as . Although the number of experts may grow with accuracy, each NO remains small, enough to fit within active memory of standard hardware for reasonable accuracy levels. Our analysis also yields new quantitative approximation rates for classical NOs approximating uniformly continuous nonlinear operators uniformly on compact subsets of .
Paper Structure (40 sections, 6 theorems, 99 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 40 sections, 6 theorems, 99 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Proposition 1

Let $D_i =[0,1]^{d_i}$ ($i=1,2$), and let $K$ be a compact set as in eq:compact_set_Sobolev_type__noncentered. Let $G^{+} : ( H^{s_1}(D_1)^{d_{in}} ,\|\cdot\|_{L^2(D_1)^{d_{in}}}) \to ( H^{s_2}(D_2)^{d_{out}} ,\|\cdot\|_{L^2(D_2)^{d_{out}}})$ be uniformly continuous with a concave modulus of continu The rank $N$, depth $L$, and width $W$ as functions of $\varepsilon$ and $\operatorname{diam}(K)$ a

Figures (1)

  • Figure 1: Complexity conservation in mixtures of neural operators (MoNO). Horizontal axis: $t = -\log_{10}\varepsilon$ denotes target precision (smaller $\varepsilon$ means higher accuracy). We define $z \stackrel{\hbox{\upshape\tiny def.}}{=} \max\{\varepsilon^{-1}, \omega(\varepsilon^{-1})\}$, where $\omega$ is the modulus of continuity. Panels: (a) Per-expert depth: classical neural operators (red) require exponential depth $L \sim \exp(z)$, while MoNO (blue dashed) maintains $L \sim z$. (b) Number of experts: classical $N=1$ versus MoNO’s polylogarithmic scaling $N \sim (\log z)^{d_1/2}$. (c) Total memory: the product $(a) \times (b) \times W$ yields $\exp(z)$ for classical models (intractable) versus $z(\log z)^{d_1/2}$ for MoNO (nearly linear). The routing cost $\mathcal{O}(\omega^{-1}(\varepsilon/[\varepsilon^{-2d_1/s_1} \vee (\omega^{-1}(\varepsilon^{-1}))^{2d_2/s_2}]))$ is negligible at this scale.

Theorems & Definitions (19)

  • Definition 1: Multilayer perceptron
  • Definition 2: Neural operator
  • Remark 1: Nonlocal operators
  • Remark 2: Integral operators in hidden layers
  • Proposition 1: Expression rates for NOs
  • proof
  • Definition 3: Mixture of neural operators
  • Remark 3: Neural operators are trivial MoNOs
  • Theorem 1: Universal approximation for MoNOs
  • proof
  • ...and 9 more