Table of Contents
Fetching ...

Model selection in atomistic simulation

Jonathan E. Moussa

TL;DR

The paper addresses how to fairly compare atomistic simulation methods that differ greatly in cost and accuracy by framing model selection as a probabilistic optimization problem. It develops a framework combining maximum likelihood estimation, information criteria (AIC/TIC), and cost penalties to balance fit quality, parameter complexity, and computational budgets. Through a hydrogen-cluster case study, it demonstrates fitting semiempirical corrections via pair-potentials, shows how TIC avoids overfitting, and reveals a Pareto front of cost versus accuracy that guides method design. The discussion extends to transferability, data generation reliability, and the practical implications for data-driven method development in quantum chemistry.

Abstract

There are many atomistic simulation methods with very different costs, accuracies, transferabilities, and numbers of empirical parameters. I show how statistical model selection can compare these methods fairly, even when they are very different. These comparisons are also useful for developing new methods that balance cost and accuracy. As an example, I build a semiempirical model for hydrogen clusters.

Model selection in atomistic simulation

TL;DR

The paper addresses how to fairly compare atomistic simulation methods that differ greatly in cost and accuracy by framing model selection as a probabilistic optimization problem. It develops a framework combining maximum likelihood estimation, information criteria (AIC/TIC), and cost penalties to balance fit quality, parameter complexity, and computational budgets. Through a hydrogen-cluster case study, it demonstrates fitting semiempirical corrections via pair-potentials, shows how TIC avoids overfitting, and reveals a Pareto front of cost versus accuracy that guides method design. The discussion extends to transferability, data generation reliability, and the practical implications for data-driven method development in quantum chemistry.

Abstract

There are many atomistic simulation methods with very different costs, accuracies, transferabilities, and numbers of empirical parameters. I show how statistical model selection can compare these methods fairly, even when they are very different. These comparisons are also useful for developing new methods that balance cost and accuracy. As an example, I build a semiempirical model for hydrogen clusters.
Paper Structure (18 sections, 42 equations, 9 figures, 1 table)

This paper contains 18 sections, 42 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Histograms of interatomic distances between hydrogen atoms in the structures from three reference data sets.
  • Figure 2: Error histograms in kcal/mol for all models and tasks along with their means, standard deviations, and moment-matching Gaussian model fits.
  • Figure 3: Error histograms in kcal/mol for DFT models and all tasks along with the means, standard deviations, and moment-matching Gaussian model fits of the marked data with consistent total spin values between HF and DFT.
  • Figure 4: Error histograms in kcal/mol for SQM models and all tasks along with the means, standard deviations, and moment-matching Gaussian model fits of the marked data from structures with minimum interatomic distances greater than 0.74 Å.
  • Figure 5: Change in penalized success measures $\hat{D}_{\mathrm{6g}} + \Delta$ as the polynomial degree of the pair potential is increased, relative to having no pair potential.
  • ...and 4 more figures