Table of Contents
Fetching ...

UMA: A Family of Universal Models for Atoms

Brandon M. Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, C. Lawrence Zitnick

TL;DR

The work introduces UMA, a family of universal interatomic potentials trained on ~500 million 3D atomic structures spanning materials, molecules, and catalysts to achieve high speed and accuracy across domains.A Mixture of Linear Experts (MoLE) within an eSEN-based equivariant GNN architecture scales model capacity without increasing inference cost, enabling large models with efficient, consistent performance.The authors establish empirical scaling laws relating compute, data, and model size, show strong zero-shot generalization across diverse tasks, and demonstrate state-of-the-art results on benchmarks like Matbench Discovery and AdsorbML, while enabling energy-conserving MD for practical simulations.They release code, weights, and data to the community to accelerate development of universal MLIPs across chemistry and materials science.

Abstract

The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, Meta FAIR presents a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization. UMA models are trained on half a billion unique 3D atomic structures (the largest training runs to date) by compiling data across multiple chemical domains, e.g. molecules, materials, and catalysts. We develop empirical scaling laws to help understand how to increase model capacity alongside dataset size to achieve the best accuracy. The UMA small and medium models utilize a novel architectural design we refer to as mixture of linear experts that enables increasing model capacity without sacrificing speed. For example, UMA-medium has 1.4B parameters but only ~50M active parameters per atomic structure. We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models. We are releasing the UMA code, weights, and associated data to accelerate computational workflows and enable the community to continue to build increasingly capable AI models.

UMA: A Family of Universal Models for Atoms

TL;DR

The work introduces UMA, a family of universal interatomic potentials trained on ~500 million 3D atomic structures spanning materials, molecules, and catalysts to achieve high speed and accuracy across domains.A Mixture of Linear Experts (MoLE) within an eSEN-based equivariant GNN architecture scales model capacity without increasing inference cost, enabling large models with efficient, consistent performance.The authors establish empirical scaling laws relating compute, data, and model size, show strong zero-shot generalization across diverse tasks, and demonstrate state-of-the-art results on benchmarks like Matbench Discovery and AdsorbML, while enabling energy-conserving MD for practical simulations.They release code, weights, and data to the community to accelerate development of universal MLIPs across chemistry and materials science.

Abstract

The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, Meta FAIR presents a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization. UMA models are trained on half a billion unique 3D atomic structures (the largest training runs to date) by compiling data across multiple chemical domains, e.g. molecules, materials, and catalysts. We develop empirical scaling laws to help understand how to increase model capacity alongside dataset size to achieve the best accuracy. The UMA small and medium models utilize a novel architectural design we refer to as mixture of linear experts that enables increasing model capacity without sacrificing speed. For example, UMA-medium has 1.4B parameters but only ~50M active parameters per atomic structure. We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models. We are releasing the UMA code, weights, and associated data to accelerate computational workflows and enable the community to continue to build increasingly capable AI models.

Paper Structure

This paper contains 44 sections, 7 equations, 5 figures, 22 tables.

Figures (5)

  • Figure 1: Visualization of the different datasets used for training. The 2D plots (bottom) illustrate the number of pairwise interactions contained in each dataset for every combination of elements. Note their combination covers nearly the entire chemical space with the exception of the radioactive elements. Model accuracies have improved with training dataset size (upper right), and this paper explores the limits of this scaling.
  • Figure 2: (left) Overview of model architecture. The SO2 convolution is made up of a set of linear operations and each one of these operations is replaced with MoLE. (middle) Illustration of MoE and . The embedding used for routing, which estimates the expert weights $\alpha$, only depends on global information making it possible to merge before the model forward pass (middle, bottom), which has substantial benefits for applications that require long roll outs such as molecular dynamics. (right) Bar plot of -S trained with for multi-task and without for single and multi-task. Note the model outperforms non- models. (right, bottom) Loss plots when varying the number of experts from 1 to 128 for -S.
  • Figure 3: Empirical scaling measurements of dense (blue) vs. (red) model architectures. FLOPs vs. validation loss for (a) dense and (b) (8-expert) models. Experiment sets are performed by holding FLOPs constant and varying model size and training data. Diamonds represent the compute optimal frontier. (c) Training compute vs. parameters. Vertical green dotted line represents our estimated training budget of $O(10^{22})$ FLOPs, horizontal dotted lines are corresponding dense and model sizes. (d) Compute vs. dataset size (atoms) with the green dotted line representing the training FLOPs required for 1 epoch of training data (50B atoms) (e) Overlay of dense vs. compute optimal frontiers from (a-b) and the fitted power law of validation loss as a function of parameters. A compute optimal model with $\Delta \approx 2.5\times$ fewer active parameters can achieve an equivalent loss. Fitting details and parameters are described in Appendix \ref{['sec:SI-scaling']}.
  • Figure 4: Pre-training curves of -L for both single-task and multi-task models. Errors are normalized based on single-task performance. Note single-task models can overfit (forces on right), and the multi-task model generally converges to lower errors.
  • Figure 5: Log mean expert coefficient across element-expert pairs.