Model selection in atomistic simulation
Jonathan E. Moussa
TL;DR
The paper addresses how to fairly compare atomistic simulation methods that differ greatly in cost and accuracy by framing model selection as a probabilistic optimization problem. It develops a framework combining maximum likelihood estimation, information criteria (AIC/TIC), and cost penalties to balance fit quality, parameter complexity, and computational budgets. Through a hydrogen-cluster case study, it demonstrates fitting semiempirical corrections via pair-potentials, shows how TIC avoids overfitting, and reveals a Pareto front of cost versus accuracy that guides method design. The discussion extends to transferability, data generation reliability, and the practical implications for data-driven method development in quantum chemistry.
Abstract
There are many atomistic simulation methods with very different costs, accuracies, transferabilities, and numbers of empirical parameters. I show how statistical model selection can compare these methods fairly, even when they are very different. These comparisons are also useful for developing new methods that balance cost and accuracy. As an example, I build a semiempirical model for hydrogen clusters.
