Table of Contents
Fetching ...

Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training

Hong Wang, Haiyang Xin, Jie Wang, Xuanze Yang, Fei Zha, Huanshuo Dong, Yan Jiang

TL;DR

The paper tackles data heterogeneity and high inference costs in pre-training neural operators for PDEs by proposing MoE-POT, a sparse Transformer that decouples capacity from computation via layer-wise router gating. It employs 16 routed experts and 2 shared experts per layer, activating Top-$K$ routed experts (e.g., $K=4$) to restrict computation while letting experts specialize across PDE types. Pre-training on six PDE datasets with total parameters from 30M to 0.5B shows that a 90M-activated-parameter model can achieve up to a 40% reduction in zero-shot error relative to models with 120M activated parameters, and router decisions can classify PDE types with high accuracy. The model also demonstrates strong interpretability, scalable performance, and efficient inference, with notable gains in downstream tasks after pre-training.

Abstract

Pre-training has proven effective in addressing data scarcity and performance limitations in solving PDE problems with neural operators. However, challenges remain due to the heterogeneity of PDE datasets in equation types, which leads to high errors in mixed training. Additionally, dense pre-training models that scale parameters by increasing network width or depth incur significant inference costs. To tackle these challenges, we propose a novel Mixture-of-Experts Pre-training Operator Transformer (MoE-POT), a sparse-activated architecture that scales parameters efficiently while controlling inference costs. Specifically, our model adopts a layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, enabling the model to focus on equation-specific features. Meanwhile, we also integrate 2 shared experts, aiming to capture common properties of PDE and reduce redundancy among routed experts. The final output is computed as the weighted average of the results from all activated experts. We pre-train models with parameters from 30M to 0.5B on 6 public PDE datasets. Our model with 90M activated parameters achieves up to a 40% reduction in zero-shot error compared with existing models with 120M activated parameters. Additionally, we conduct interpretability analysis, showing that dataset types can be inferred from router-gating network decisions, which validates the rationality and effectiveness of the MoE architecture.

Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training

TL;DR

The paper tackles data heterogeneity and high inference costs in pre-training neural operators for PDEs by proposing MoE-POT, a sparse Transformer that decouples capacity from computation via layer-wise router gating. It employs 16 routed experts and 2 shared experts per layer, activating Top- routed experts (e.g., ) to restrict computation while letting experts specialize across PDE types. Pre-training on six PDE datasets with total parameters from 30M to 0.5B shows that a 90M-activated-parameter model can achieve up to a 40% reduction in zero-shot error relative to models with 120M activated parameters, and router decisions can classify PDE types with high accuracy. The model also demonstrates strong interpretability, scalable performance, and efficient inference, with notable gains in downstream tasks after pre-training.

Abstract

Pre-training has proven effective in addressing data scarcity and performance limitations in solving PDE problems with neural operators. However, challenges remain due to the heterogeneity of PDE datasets in equation types, which leads to high errors in mixed training. Additionally, dense pre-training models that scale parameters by increasing network width or depth incur significant inference costs. To tackle these challenges, we propose a novel Mixture-of-Experts Pre-training Operator Transformer (MoE-POT), a sparse-activated architecture that scales parameters efficiently while controlling inference costs. Specifically, our model adopts a layer-wise router-gating network to dynamically select 4 routed experts from 16 expert networks during inference, enabling the model to focus on equation-specific features. Meanwhile, we also integrate 2 shared experts, aiming to capture common properties of PDE and reduce redundancy among routed experts. The final output is computed as the weighted average of the results from all activated experts. We pre-train models with parameters from 30M to 0.5B on 6 public PDE datasets. Our model with 90M activated parameters achieves up to a 40% reduction in zero-shot error compared with existing models with 120M activated parameters. Additionally, we conduct interpretability analysis, showing that dataset types can be inferred from router-gating network decisions, which validates the rationality and effectiveness of the MoE architecture.

Paper Structure

This paper contains 40 sections, 19 equations, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Left. An illustration of pre-training a PDE foundation model using extensive data from diverse datasets. The pre-trained model is subsequently fine-tuned for various downstream operator learning tasks, enabling the handling of complex scenarios. Right.(a) Comparison of errors across different numbers of fine-tuning epochs; (b) Comparison of zero-shot errors across different models.
  • Figure 2: Left. The impact of mixed training on multiple datasets using FNO on model performance. Right. Usage ratio of routed experts in different datasets in block 4.
  • Figure 3: An illustration of our model architecture. The process begins by sampling trajectories from mixed datasets of multiple PDEs. The model is optimized by predicting the next frame based on previous frames. The mixture-of-experts layer consists of shared experts, routed experts, and a router-gating network. The router-gating network is responsible for selecting the routed experts, enabling efficient and specialized processing while ensuring scalability and modularity.
  • Figure 4: (a) Comparison of errors across different numbers of fine-tuning epochs; (b) Comparison of zero-shot errors across different models; (c) Dataset classification accuracy based on router-gating network selection.