Table of Contents
Fetching ...

Dataset Distillation for Machine Learning Force Field in Phase Transition Regime

Ruiyang Chen, Qingyuan Zhang, Ji Chen

Abstract

Machine learning force field (MLFF) has emerged as a powerful data-driven tool for atomistic simulations, enabling large-scale and complex atomic systems to be simulated with accuracy comparable to \textit{ab initio} methods. However, MLFFs often suffer from low training efficiency in the phase transition regime, where structural fluctuations are significantly elevated. To address this challenge, we propose a Central-Peripheral Distillation (CPD) algorithm for training dataset distillation. By strategically integrating representative samples with critical corner cases, the CPD algorithm ensures that the distilled dataset retains maximum structural diversity. We validated the efficacy of the CPD method on the liquid-liquid phase transition of dense hydrogen. Results show that, with the CPD approach, only 200 configurations are sufficient to train a MLFF that can fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its phase transition regime. This work paves the way for high-fidelity labeling of the MLFF training datasets, for instance by adopting high-level \textit{ab initio} calculations beyond the standard density functional theory, thereby enhancing the predictive accuracy of MLFFs.

Dataset Distillation for Machine Learning Force Field in Phase Transition Regime

Abstract

Machine learning force field (MLFF) has emerged as a powerful data-driven tool for atomistic simulations, enabling large-scale and complex atomic systems to be simulated with accuracy comparable to \textit{ab initio} methods. However, MLFFs often suffer from low training efficiency in the phase transition regime, where structural fluctuations are significantly elevated. To address this challenge, we propose a Central-Peripheral Distillation (CPD) algorithm for training dataset distillation. By strategically integrating representative samples with critical corner cases, the CPD algorithm ensures that the distilled dataset retains maximum structural diversity. We validated the efficacy of the CPD method on the liquid-liquid phase transition of dense hydrogen. Results show that, with the CPD approach, only 200 configurations are sufficient to train a MLFF that can fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its phase transition regime. This work paves the way for high-fidelity labeling of the MLFF training datasets, for instance by adopting high-level \textit{ab initio} calculations beyond the standard density functional theory, thereby enhancing the predictive accuracy of MLFFs.

Paper Structure

This paper contains 11 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Schematic illustration of the CPD sampling workflow. The CPD algorithm extracts molecular features via MACE and PCA, followed by an optimized local density analysis. By employing a dual-focus weighted sampling strategy—targeting the top 20% densest (central) and 20% sparsest (peripheral) points, the model captures both representative phase characteristics and critical rare configurations, maximizing the structural diversity of the distilled dataset.
  • Figure 2: Structural characterization and phase transition of the hydrogen test dataset for LLPT at 1000 K. (a) Histogram of the density distribution for the 575 configurations, covering a range from 0.98 to 1.41 g/cm3. (b) Radial distribution functions (RDFs) for the molecular($\rho=0.98g/cm^3$), transition($\rho=1.15g/cm^3$), and atomic regimes($\rho=1.38g/cm^3$), illustrating the structural evolution during the phase transition. (c--e) Representative snapshots at different densities, where hydrogen atoms in the atomic and molecular phases are colored light gray and light blue, respectively. (c) molecular phase, (d) transition region and (e) atomic phase.
  • Figure 3: Comparison of energy and force prediction performance. The RMSE of energy (a) and force (b) are plotted as a function of the number of training data selected using different data distillation methods. The dashed line indicates the error achieved with the full training dataset before distillation.
  • Figure 4: Performance of MLFFs trained on different datasets for hydrogen LLPT at 1000 K. Pressure (a) and molecular fraction (b) as a function of the density, labeled by Wigner-Seize Radius. The molecular fraction is defined following ref. cheng2020evidence, and a detailed calculation description in provided in SI.S3. The inset in (a) is a zoom-in of the phase transition regime. Models trained on DIRECT-distilled and randomly sampled datasets failed to yield reliable results for the first ten and five data points of the atomic phase, respectively. Consequently, these points are omitted from the figures.
  • Figure S1: Evolution of training and validation losses over 500 epochs for machine learning force fields trained on various datasets. The subpanels illustrate the convergence behavior for: (a) the full reference dataset comprising 575 configurations; (b) the CPD-distilled dataset (200 configurations); (c) a baseline dataset comprising 200 randomly sampled configurations; (d) the DIRECT-distilled dataset (200 configurations); and (e) the RND-distilled dataset (200 configurations). All models demonstrate stable convergence within the allocated training epochs.
  • ...and 1 more figures