Table of Contents
Fetching ...

Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials

Terry C. W. Lam, Niamh O'Neill, Christoph Schran, Lars L. Schaaf

TL;DR

An on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations, is introduced by tracking the loss distribution via an exponential moving average, which prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead.

Abstract

The accuracy of machine learning interatomic potentials suffers from reference data that contains numerical noise. Often originating from unconverged or inconsistent electronic-structure calculations, this noise is challenging to identify. Existing mitigation strategies such as manual filtering or iterative refinement of outliers, require either substantial expert effort or multiple expensive retraining cycles, making them difficult to scale to large datasets. Here, we introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations. By tracking the loss distribution via an exponential moving average, this unsupervised method identifies outliers throughout a single training run. We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead. The method's effectiveness is demonstrated by recovering accurate physical observables for liquid water from unconverged reference data, including diffusion coefficients. Furthermore, we validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three. This framework provides a simple, automated solution for training robust models on imperfect datasets across dataset sizes.

Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials

TL;DR

An on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations, is introduced by tracking the loss distribution via an exponential moving average, which prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead.

Abstract

The accuracy of machine learning interatomic potentials suffers from reference data that contains numerical noise. Often originating from unconverged or inconsistent electronic-structure calculations, this noise is challenging to identify. Existing mitigation strategies such as manual filtering or iterative refinement of outliers, require either substantial expert effort or multiple expensive retraining cycles, making them difficult to scale to large datasets. Here, we introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations. By tracking the loss distribution via an exponential moving average, this unsupervised method identifies outliers throughout a single training run. We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead. The method's effectiveness is demonstrated by recovering accurate physical observables for liquid water from unconverged reference data, including diffusion coefficients. Furthermore, we validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three. This framework provides a simple, automated solution for training robust models on imperfect datasets across dataset sizes.
Paper Structure (16 sections, 6 equations, 8 figures)

This paper contains 16 sections, 6 equations, 8 figures.

Figures (8)

  • Figure 1: On-the-fly outlier detection. (a) A high level overview of the steps for detecting outliers in the training data and down weighting their impact on the training loss. (b) A overview of the changes in the training procedure, showing normal MLIP training steps (top) and the new on-the-fly bootstrapping technique (bottom). Number labels indicate which parts of the training modifications correspond to the high level overview in (a).
  • Figure 2: Noise-resilient training on a revMD17 dataset. (a) Log-log plots of the error evolution with epoch in the 2000-epoch training. Each line represents one configuration, and are coloured according to whether the labels were from revMD17 ('clean', green), or from MD17 ('noisy', red). (b) Force error distribution at early (${\sim}10^0$), middle (${\sim}10^1$) and late epochs (${\sim}10^2$). The assignment of the bootstrapping weights based on the force error shown.
  • Figure 3: Bootstrapping prevents overfitting. Force error curves of the noisy samples (red), clean samples (green), and samples in the validation set (blue), as a function of epoch, without and with bootstrapping applied. Thick line represents median and shaded area represents inter-quartile range (IQR). This training error is defined as the RMSE with respect to the reference (potentially false) forces the training sees. The noisy training data is re-evaluated with more converged reference method. Comparisons between the models predictions and the unseen re-evaluated noisy configurations (orange), indicates that the noise resilient model is able to predict the true forces at validation accuracy.
  • Figure 4: Noise-resilient training compared to iterative refinement. (a) The schematics of iterative refinement , which involves re-evaluating the weights and then repeating the training for multiple cycles. The weight calculation is the same as bootstrapping without the need for an EMA. (b) The first model (step 0) is trained respectively with (green, 'bootstrapped') and without (red, 'vanilla') bootstrapping. Distillation is done 4 times. Force error with error bars (median and IQR) at the end of the 2000 epochs is plotted. Bootstrapping reaches the minimum force error ($\sim30\text{ meV/Å}$) without requiring refinement, while at least 2 refinement steps were required for a normal MACE model to reach this accuracy. A one-step refinement is also performed by stopping the initial training early at the 240th epoch (blue).
  • Figure 5: Effect on observables of bulk water simulations. (a) (left) Histogram plot of RMSE forces produced by DFT in a certain loose convergence threshold and (right) the distribution of the forces and the residuals. (b) The self-diffusion coefficient of water at 298 K and 1 atm obtained in the classical MD simulation in NVT conditions, for a noisy dataset (red), and the noisy dataset trained with bootstrapping (green). Reference value computed from a clean dataset indicated as dotted line. (c) Radial distribution functions for (left) O-O, (middle) O-H, and (right) H-H interatomic distances in bulk liquid water at 298 K and 1 atm, taken from a classical MD simulation in NVT conditions. Black dotted line plots the reference value. Residuals shown at the bottom.
  • ...and 3 more figures