Mixture of Experts Softens the Curse of Dimensionality in Operator Learning
Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop
TL;DR
The paper tackles the challenge of scaling operator learning to infinite-dimensional spaces by introducing a distributed mixture of neural operators (MoNO) arranged as a rooted routing tree. A central result is a distributed universal approximation theorem: any Lipschitz operator between Sobolev-function spaces on $[0,1]^d$ can be uniformly approximated on a Sobolev unit ball by a MoNO, with each active expert requiring depth, width, and rank that scale as $O(\varepsilon^{-1})$, while the total model grows with the number of leaves. This framework distributes the curse of dimensionality across many small, local experts, enabling feasible memory usage and on-demand loading of parameters, which supports scalable inference and training. Additional contributions include quantitative rates for classical neural operators and a detailed constructive proof that combines finite-dimensional encodings, rank-truncated integral-like mappings, and an efficient tree-based routing strategy. The results have practical implications for high-dimensional operator learning, offering a principled route to deploy large MoNOs on hardware with memory constraints and to apply operator learning to inverse problems and other infinite-dimensional settings. \n
Abstract
We study the approximation-theoretic implications of mixture-of-experts architectures for operator learning, where the complexity of a single large neural operator is distributed across many small neural operators (NOs), and each input is routed to exactly one NO via a decision tree. We analyze how this tree-based routing and expert decomposition affect approximation power, sample complexity, and stability. Our main result is a distributed universal approximation theorem for mixture of neural operators (MoNOs): any Lipschitz nonlinear operator between $L^2([0,1]^d)$ spaces can be uniformly approximated over the Sobolev unit ball to arbitrary accuracy $\varepsilon>0$ by an MoNO, where each expert NO has a depth, width, and rank scaling as $\mathcal{O}(\varepsilon^{-1})$. Although the number of experts may grow with accuracy, each NO remains small, enough to fit within active memory of standard hardware for reasonable accuracy levels. Our analysis also yields new quantitative approximation rates for classical NOs approximating uniformly continuous nonlinear operators uniformly on compact subsets of $L^2([0,1]^d)$.
