Table of Contents
Fetching ...

Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields

Yi Cao, Paulette Clancy

TL;DR

This work introduces migration-based probes as a general, physics-informed benchmark to compare specialist MLFFs trained from scratch against foundation-model fine-tuning for Cr-doped Sb$_2$Te$_3$. It demonstrates that fine-tuning sharply improves kinetic predictions but can degrade long-range physics, while foundation models offer robust extrapolation yet may require system-specific sharpening. Latent-space analyses reveal fundamentally different encodings across training strategies, explaining why models diverge on non-equilibrium pathways. The framework guides data-efficient active learning and stresses the importance of evaluating dynamic properties alongside equilibrium metrics for reliable MLFF deployment.

Abstract

Machine-learned force fields (MLFFs), especially pre-trained foundation models, are transforming computational materials science by enabling ab initio-level accuracy at molecular dynamics scales. Yet their rapid rise raises a key question: should researchers train specialist models from scratch, fine-tune generalist foundation models, or use hybrid approaches? The trade-offs in data efficiency, accuracy, cost, and robustness to out-of-distribution failure remain unclear. We introduce a benchmarking framework using defect migration pathways, evaluated through nudged elastic band trajectories, as diagnostic probes that test both interpolation and extrapolation. Using Cr-doped Sb2Te3 as a representative two-dimensional material, we benchmark multiple training paradigms within the MACE architecture across equilibrium, kinetic (atomic migration), and mechanical (interlayer sliding) tasks. Fine-tuned models substantially outperform from-scratch and zero-shot approaches for kinetic properties but show partial loss of long-range physics. Representational analysis reveals distinct, non-overlapping latent encodings, indicating that different training strategies learn different aspects of system physics. This framework provides practical guidelines for MLFF development and establishes migration-based probes as efficient diagnostics linking performance to learned representations, guiding future uncertainty-aware active learning.

Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields

TL;DR

This work introduces migration-based probes as a general, physics-informed benchmark to compare specialist MLFFs trained from scratch against foundation-model fine-tuning for Cr-doped SbTe. It demonstrates that fine-tuning sharply improves kinetic predictions but can degrade long-range physics, while foundation models offer robust extrapolation yet may require system-specific sharpening. Latent-space analyses reveal fundamentally different encodings across training strategies, explaining why models diverge on non-equilibrium pathways. The framework guides data-efficient active learning and stresses the importance of evaluating dynamic properties alongside equilibrium metrics for reliable MLFF deployment.

Abstract

Machine-learned force fields (MLFFs), especially pre-trained foundation models, are transforming computational materials science by enabling ab initio-level accuracy at molecular dynamics scales. Yet their rapid rise raises a key question: should researchers train specialist models from scratch, fine-tune generalist foundation models, or use hybrid approaches? The trade-offs in data efficiency, accuracy, cost, and robustness to out-of-distribution failure remain unclear. We introduce a benchmarking framework using defect migration pathways, evaluated through nudged elastic band trajectories, as diagnostic probes that test both interpolation and extrapolation. Using Cr-doped Sb2Te3 as a representative two-dimensional material, we benchmark multiple training paradigms within the MACE architecture across equilibrium, kinetic (atomic migration), and mechanical (interlayer sliding) tasks. Fine-tuned models substantially outperform from-scratch and zero-shot approaches for kinetic properties but show partial loss of long-range physics. Representational analysis reveals distinct, non-overlapping latent encodings, indicating that different training strategies learn different aspects of system physics. This framework provides practical guidelines for MLFF development and establishes migration-based probes as efficient diagnostics linking performance to learned representations, guiding future uncertainty-aware active learning.

Paper Structure

This paper contains 46 sections, 7 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Conceptual Diagram. A flowchart illustrating two competing workflows: (a) training a specialist MLFF from scratch, and (b) fine-tuning a generalist foundation model. The diagram highlights the benchmarking questions this paper addresses.
  • Figure 2: Comprehensive Benchmarking of MLFF Performance in Molecular Dynamics Simulations. (a) A schematic of the unified evaluation protocol. Four candidate models—a zero-shot foundation model, a bespoke model trained from scratch, and two fine-tuned variants—are each used to drive a 200 ps MD simulation. The resulting trajectories are then subjected to a uniform set of post-processing analyses to evaluate key physical properties. (b) Thermodynamic stability, demonstrated by the evolution of temperature and pressure over the simulation, which remain stable around their target values for all models. (c) Thermal transport properties, showing the Heat Flux Autocorrelation Function (HFACF) and its running integral to compute thermal conductivity ($\kappa$). (d) The Mean Squared Displacement (MSD), used to assess atomic mobility and calculate the diffusion coefficient. (e) The Velocity Autocorrelation Function (VACF), which describes the system's underlying dynamics.
  • Figure 3: Comparative analysis of MLFF training strategies for predicting atomic migration barriers. (a) Schematic illustration of the simulated Cr atom migration pathway between two stable sites within the Sb$_2$Te$_3$ bilayer. (b) Bar plot showing the migration energy prediction errors for various models, spanning bespoke training to advanced active learning approaches. (c--f) Comparison of the minimum energy pathway (MEP) profiles for the Cr migration process. The solid black line denotes the ground-truth DFT reference, while dashed lines represent predictions from MACE models trained with different strategies.
  • Figure 4: Evaluation of MACE models on collective lattice displacement: interlayer sliding in Sb$_2$Te$_3$. (a) Schematic illustration of the bilayer sliding process in Sb$_2$Te$_3$, where the top layer (yellow/orange atoms) slides relative to the bottom layer along the crystallographic direction. The purple sphere indicates the position of a Cr dopant when present. (b) Energy barriers for interlayer sliding in pristine Sb$_2$Te$_3$ as calculated by DFT (black, ground truth) and various MACE models. The reaction coordinate represents the normalized sliding distance from the initial to final configuration. (c) Energy barriers for the same sliding process in Cr-doped Sb$_2$Te$_3$.
  • Figure 5: Representation Analysis Reveals How Fine-Tuning Aligns a Physical Manifold to Enable Generalization. Projections of atomic environment descriptors from 600 K MD trajectories. (a) t-SNE projection shows that Foundation (red) and Scratch (orange) models produce highly separated representations, while fine-tuned models (blue) are intermediate. (b) PHATE projection reveals the system's continuous dynamical manifold. Note the clear separation between the brittle scratch model representation and the regularized fine-tuned models along the vertical axis (PHATE 2). (c) The same PHATE embedding colored by potential energy confirms that the manifold's geometry corresponds to the physical energy landscape. The foundation model's large energy offset indicates its nature as an uncalibrated prior. (d) Average silhouette scores (from t-SNE) quantify the representational dissimilarity. The results demonstrate that fine-tuning succeeds by constraining a generalist representation to the specific physical manifold of the target system, a process that regularizes the model and enables accurate prediction of complex dynamics like diffusion.
  • ...and 3 more figures