Mixture of Experts Softens the Curse of Dimensionality in Operator Learning

Anastasis Kratsios; Takashi Furuya; Jose Antonio Lara Benitez; Matti Lassas; Maarten de Hoop

Mixture of Experts Softens the Curse of Dimensionality in Operator Learning

Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop

TL;DR

The paper tackles the challenge of scaling operator learning to infinite-dimensional spaces by introducing a distributed mixture of neural operators (MoNO) arranged as a rooted routing tree. A central result is a distributed universal approximation theorem: any Lipschitz operator between Sobolev-function spaces on $[0,1]^d$ can be uniformly approximated on a Sobolev unit ball by a MoNO, with each active expert requiring depth, width, and rank that scale as $O(\varepsilon^{-1})$, while the total model grows with the number of leaves. This framework distributes the curse of dimensionality across many small, local experts, enabling feasible memory usage and on-demand loading of parameters, which supports scalable inference and training. Additional contributions include quantitative rates for classical neural operators and a detailed constructive proof that combines finite-dimensional encodings, rank-truncated integral-like mappings, and an efficient tree-based routing strategy. The results have practical implications for high-dimensional operator learning, offering a principled route to deploy large MoNOs on hardware with memory constraints and to apply operator learning to inverse problems and other infinite-dimensional settings. \n

Abstract

We study the approximation-theoretic implications of mixture-of-experts architectures for operator learning, where the complexity of a single large neural operator is distributed across many small neural operators (NOs), and each input is routed to exactly one NO via a decision tree. We analyze how this tree-based routing and expert decomposition affect approximation power, sample complexity, and stability. Our main result is a distributed universal approximation theorem for mixture of neural operators (MoNOs): any Lipschitz nonlinear operator between $L^2([0,1]^d)$ spaces can be uniformly approximated over the Sobolev unit ball to arbitrary accuracy $\varepsilon>0$ by an MoNO, where each expert NO has a depth, width, and rank scaling as $\mathcal{O}(\varepsilon^{-1})$. Although the number of experts may grow with accuracy, each NO remains small, enough to fit within active memory of standard hardware for reasonable accuracy levels. Our analysis also yields new quantitative approximation rates for classical NOs approximating uniformly continuous nonlinear operators uniformly on compact subsets of $L^2([0,1]^d)$.

Mixture of Experts Softens the Curse of Dimensionality in Operator Learning

TL;DR

can be uniformly approximated on a Sobolev unit ball by a MoNO, with each active expert requiring depth, width, and rank that scale as

, while the total model grows with the number of leaves. This framework distributes the curse of dimensionality across many small, local experts, enabling feasible memory usage and on-demand loading of parameters, which supports scalable inference and training. Additional contributions include quantitative rates for classical neural operators and a detailed constructive proof that combines finite-dimensional encodings, rank-truncated integral-like mappings, and an efficient tree-based routing strategy. The results have practical implications for high-dimensional operator learning, offering a principled route to deploy large MoNOs on hardware with memory constraints and to apply operator learning to inverse problems and other infinite-dimensional settings. \n

Abstract

spaces can be uniformly approximated over the Sobolev unit ball to arbitrary accuracy

by an MoNO, where each expert NO has a depth, width, and rank scaling as

. Although the number of experts may grow with accuracy, each NO remains small, enough to fit within active memory of standard hardware for reasonable accuracy levels. Our analysis also yields new quantitative approximation rates for classical NOs approximating uniformly continuous nonlinear operators uniformly on compact subsets of

Paper Structure (40 sections, 6 theorems, 99 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 40 sections, 6 theorems, 99 equations, 1 figure, 2 tables, 1 algorithm.

Introduction
Contributions
Secondary Contribution
Organization
Background and notation
Sobolev spaces and $(H^s(D)^{d_{in}},\vert\cdot\vert_{L^2(D)^{d_{in}}})$ space
Rooted Tree
Deep learning models and basic approximation rates
Neural networks defined in finite and infinite dimensions
Quantitative universal approximation of neural operators
Mixture of neural operators
Realization of a mixture of neural operators $(\mathcal{T}, \mathcal{NO})$
Active, total, and routing complexity
Main Result
On assumptions about the domain, and architectures
...and 25 more sections

Key Result

Proposition 1

Let $D_i =[0,1]^{d_i}$ ($i=1,2$), and let $K$ be a compact set as in eq:compact_set_Sobolev_type__noncentered. Let $G^{+} : ( H^{s_1}(D_1)^{d_{in}} ,\|\cdot\|_{L^2(D_1)^{d_{in}}}) \to ( H^{s_2}(D_2)^{d_{out}} ,\|\cdot\|_{L^2(D_2)^{d_{out}}})$ be uniformly continuous with a concave modulus of continu The rank $N$, depth $L$, and width $W$ as functions of $\varepsilon$ and $\operatorname{diam}(K)$ a

Figures (1)

Figure 1: Complexity conservation in mixtures of neural operators (MoNO). Horizontal axis: $t = -\log_{10}\varepsilon$ denotes target precision (smaller $\varepsilon$ means higher accuracy). We define $z \stackrel{\hbox{\upshape\tiny def.}}{=} \max\{\varepsilon^{-1}, \omega(\varepsilon^{-1})\}$, where $\omega$ is the modulus of continuity. Panels: (a) Per-expert depth: classical neural operators (red) require exponential depth $L \sim \exp(z)$, while MoNO (blue dashed) maintains $L \sim z$. (b) Number of experts: classical $N=1$ versus MoNO’s polylogarithmic scaling $N \sim (\log z)^{d_1/2}$. (c) Total memory: the product $(a) \times (b) \times W$ yields $\exp(z)$ for classical models (intractable) versus $z(\log z)^{d_1/2}$ for MoNO (nearly linear). The routing cost $\mathcal{O}(\omega^{-1}(\varepsilon/[\varepsilon^{-2d_1/s_1} \vee (\omega^{-1}(\varepsilon^{-1}))^{2d_2/s_2}]))$ is negligible at this scale.

Theorems & Definitions (19)

Definition 1: Multilayer perceptron
Definition 2: Neural operator
Remark 1: Nonlocal operators
Remark 2: Integral operators in hidden layers
Proposition 1: Expression rates for NOs
proof
Definition 3: Mixture of neural operators
Remark 3: Neural operators are trivial MoNOs
Theorem 1: Universal approximation for MoNOs
proof
...and 9 more

Mixture of Experts Softens the Curse of Dimensionality in Operator Learning

TL;DR

Abstract

Mixture of Experts Softens the Curse of Dimensionality in Operator Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (19)