Learning From Simulators: A Theory of Simulation-Grounded Learning
Carson Dudley, Marisa Eisenberg
TL;DR
This work develops a formal theory of Simulation-Grounded Neural Networks (SGNNs), which learn predictive mappings entirely from synthetic data produced by mechanistic simulators. It treats SGNNs as amortized Bayesian predictors under a simulator-induced prior and proves convergence to the Bayes-optimal predictor under the synthetic distribution, with a generalization bound that separates learning performance from simulator–reality mismatch. The authors introduce back-to-simulation attribution for mechanistic interpretability and prove consistency of the attribution with the posterior over latent simulator parameters. They further show that SGNNs can learn unobservable scientific quantities under identifiability and validate the theory with controlled experiments that demonstrate learning of latent parameters, robustness to misspecification, and superior model selection between mechanistic structures. Collectively, the results provide a principled, practical foundation for prediction, interpretation, and discovery in data-limited scientific domains.
Abstract
Simulation-Grounded Neural Networks (SGNNs) are predictive models trained entirely on synthetic data from mechanistic simulations. They have achieved state-of-the-art performance in domains where real-world labels are limited or unobserved, but lack a formal underpinning. We place SGNNs in a unified statistical framework. Under standard loss functions, they can be interpreted as amortized Bayesian predictors trained under a simulator-induced prior. Empirical risk minimization then yields convergence to the Bayes-optimal predictor under the synthetic distribution. We employ classical results on distribution shift to characterize how performance degrades when the simulator diverges from reality. Beyond these consequences, we develop SGNN-specific results: (i) conditions under which unobserved scientific parameters are learnable via simulation, and (ii) a back-to-simulation attribution method that provides mechanistic explanations of predictions by linking them to the simulations the model deems similar, with guarantees of posterior consistency. We provide numerical experiments to validate theoretical predictions. SGNNs recover latent parameters, remain robust under mismatch, and outperform classical tools: in a model selection task, SGNNs achieve half the error of AIC in distinguishing mechanistic dynamics. These results establish SGNNs as a principled and practical framework for scientific prediction in data-limited regimes.
