Table of Contents
Fetching ...

Scaling Machine Learning Interatomic Potentials with Mixtures of Experts

Yuzhi Liu, Duo Zhang, Anyang Peng, Weinan E, Linfeng Zhang, Han Wang

TL;DR

It is shown that sparse activation combined with shared experts yields substantial performance gains, and that nonlinear MoE formulations outperform MoLE when shared experts are present, underscoring the importance of nonlinear expert specialization.

Abstract

Machine Learning Interatomic Potentials (MLIPs) enable accurate large-scale atomistic simulations, yet improving their expressive capacity efficiently remains challenging. Here we systematically develop Mixture-of-Experts (MoE) and Mixture-of-Linear-Experts (MoLE) architectures for MLIPs and analyze the effects of routing strategies and expert designs. We show that sparse activation combined with shared experts yields substantial performance gains, and that nonlinear MoE formulations outperform MoLE when shared experts are present, underscoring the importance of nonlinear expert specialization. Furthermore, element-wise routing consistently surpasses configuration-level routing, while global MoE routing often leads to numerical instability. The resulting element-wise MoE model achieves state-of-the-art accuracy across the OMol25, OMat24, and OC20M benchmarks. Analysis of routing patterns reveals chemically interpretable expert specialization aligned with periodic-table trends, indicating that the model effectively captures element-specific chemical characteristics for precise interatomic modeling.

Scaling Machine Learning Interatomic Potentials with Mixtures of Experts

TL;DR

It is shown that sparse activation combined with shared experts yields substantial performance gains, and that nonlinear MoE formulations outperform MoLE when shared experts are present, underscoring the importance of nonlinear expert specialization.

Abstract

Machine Learning Interatomic Potentials (MLIPs) enable accurate large-scale atomistic simulations, yet improving their expressive capacity efficiently remains challenging. Here we systematically develop Mixture-of-Experts (MoE) and Mixture-of-Linear-Experts (MoLE) architectures for MLIPs and analyze the effects of routing strategies and expert designs. We show that sparse activation combined with shared experts yields substantial performance gains, and that nonlinear MoE formulations outperform MoLE when shared experts are present, underscoring the importance of nonlinear expert specialization. Furthermore, element-wise routing consistently surpasses configuration-level routing, while global MoE routing often leads to numerical instability. The resulting element-wise MoE model achieves state-of-the-art accuracy across the OMol25, OMat24, and OC20M benchmarks. Analysis of routing patterns reveals chemically interpretable expert specialization aligned with periodic-table trends, indicating that the model effectively captures element-specific chemical characteristics for precise interatomic modeling.
Paper Structure (10 sections, 8 equations, 4 figures, 1 table)

This paper contains 10 sections, 8 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Schematic illustration of the model architectures. (a) The standard MLP framework. (b) The MoE-E framework, featuring a router that dynamically selects 3 active experts from the pool, complemented by 1 fixed shared expert to capture universal features. (c) The MoLE-E framework, where 4 expert weights are linearly combined via the router to modulate the main network weights, including 1 shared expert component.
  • Figure 2: Performance benchmarks on the OMol25 dataset. (a) Impact of shared expert allocation. Normalized energy and force MAE for MoE-E are shown as a function of the shared expert ratio, using a total pool of 64 experts. The x-axis denotes the proportion of shared experts (e.g., $(2/6)$ indicates 2 shared experts among 6 total activated experts). Background shading demarcates different $K$ activation settings for MoE-E. The horizontal dashed line provides a performance baseline corresponding to a standard 6-layer DPA3 model. (b) Scaling with total expert count: Comparative analysis of normalized energy (left) and force (right) MAE as a function of the total number of experts. Two baselines are provided: the standard 6-layer DPA3 model and a "$4\times$ Params" variant achieved by doubling the hidden dimension of the 6-layer DPA3 model.
  • Figure 3: Performance benchmarks for normalized energy (left) and force (right) predictions across three datasets: OMol25, OMat24, and OC20M. The baseline corresponds to the standard DPA3 model, while the "6$\times$ Params" is scaled to six times the total parameters of the baseline via wider hidden layers. The MoLE-E model employs 64 experts with 3 shared experts, and the MoE-E model uses 64 experts with 3 shared experts out of $K=6$ activated experts.
  • Figure 4: Expert weighting distribution analysis via PCA for various elemental groups. (a) Overview of the general distribution for all elements. (b) Distribution patterns for alkali metals, alkaline earth metals, and noble gases. (c) Regional distribution of p-block groups except noble gas, specifically Boron (Group 13) through Halogen (Group 17) elements.(d) Spatial clustering of transition metal elements.