Table of Contents
Fetching ...

NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

Dengdi Sun, Xiaoya Zhou, Xiao Wang, Hao Si, Wanli Lyu, Jin Tang, Bin Luo

TL;DR

A large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework that can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability is proposed.

Abstract

Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.

NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

TL;DR

A large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework that can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability is proposed.

Abstract

Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.
Paper Structure (33 sections, 29 equations, 11 figures, 9 tables)

This paper contains 33 sections, 29 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Comparison of two different network architectures. (a) Traditional single-network architecture; (b) our proposed nested MoE architecture, where image-level MoE experts learn global diversity across different PDE types, while token-level Sub-MoE experts capture complex local features within equations.
  • Figure 2: Overview architecture. We train on twelve mixed PDE datasets, predicting the next frame based on the preceding frames. We design a nested MoE architecture: (1) the top shows the overall model architecture; (2) the bottom right illustrates the nested Sub-MoE architecture; and (3) the bottom left depicts the improved Flash Attention architecture.
  • Figure 3: Visualization of 2D high-resolution turbulence prediction results. (1) The first column shows the true values, the second shows the model predictions, and the third shows the corresponding errors. (2) The predicted physical quantities are horizontal velocity, vertical velocity, density field, and pressure field.
  • Figure 4: Performance comparison of different models on the 2D high-resolution turbulence task. We use L2RE as the evaluation metric, where Vanilla denotes training from scratch, and -FT indicates results after 500 fine-tuning epochs on the downstream task.
  • Figure 5: Visualization of spatial activation patterns produced by token-level experts in the Sub-MoE layer. For each input sample, the expert activation probabilities are projected onto heatmaps.
  • ...and 6 more figures