Table of Contents
Fetching ...

Learning From Simulators: A Theory of Simulation-Grounded Learning

Carson Dudley, Marisa Eisenberg

TL;DR

This work develops a formal theory of Simulation-Grounded Neural Networks (SGNNs), which learn predictive mappings entirely from synthetic data produced by mechanistic simulators. It treats SGNNs as amortized Bayesian predictors under a simulator-induced prior and proves convergence to the Bayes-optimal predictor under the synthetic distribution, with a generalization bound that separates learning performance from simulator–reality mismatch. The authors introduce back-to-simulation attribution for mechanistic interpretability and prove consistency of the attribution with the posterior over latent simulator parameters. They further show that SGNNs can learn unobservable scientific quantities under identifiability and validate the theory with controlled experiments that demonstrate learning of latent parameters, robustness to misspecification, and superior model selection between mechanistic structures. Collectively, the results provide a principled, practical foundation for prediction, interpretation, and discovery in data-limited scientific domains.

Abstract

Simulation-Grounded Neural Networks (SGNNs) are predictive models trained entirely on synthetic data from mechanistic simulations. They have achieved state-of-the-art performance in domains where real-world labels are limited or unobserved, but lack a formal underpinning. We place SGNNs in a unified statistical framework. Under standard loss functions, they can be interpreted as amortized Bayesian predictors trained under a simulator-induced prior. Empirical risk minimization then yields convergence to the Bayes-optimal predictor under the synthetic distribution. We employ classical results on distribution shift to characterize how performance degrades when the simulator diverges from reality. Beyond these consequences, we develop SGNN-specific results: (i) conditions under which unobserved scientific parameters are learnable via simulation, and (ii) a back-to-simulation attribution method that provides mechanistic explanations of predictions by linking them to the simulations the model deems similar, with guarantees of posterior consistency. We provide numerical experiments to validate theoretical predictions. SGNNs recover latent parameters, remain robust under mismatch, and outperform classical tools: in a model selection task, SGNNs achieve half the error of AIC in distinguishing mechanistic dynamics. These results establish SGNNs as a principled and practical framework for scientific prediction in data-limited regimes.

Learning From Simulators: A Theory of Simulation-Grounded Learning

TL;DR

This work develops a formal theory of Simulation-Grounded Neural Networks (SGNNs), which learn predictive mappings entirely from synthetic data produced by mechanistic simulators. It treats SGNNs as amortized Bayesian predictors under a simulator-induced prior and proves convergence to the Bayes-optimal predictor under the synthetic distribution, with a generalization bound that separates learning performance from simulator–reality mismatch. The authors introduce back-to-simulation attribution for mechanistic interpretability and prove consistency of the attribution with the posterior over latent simulator parameters. They further show that SGNNs can learn unobservable scientific quantities under identifiability and validate the theory with controlled experiments that demonstrate learning of latent parameters, robustness to misspecification, and superior model selection between mechanistic structures. Collectively, the results provide a principled, practical foundation for prediction, interpretation, and discovery in data-limited scientific domains.

Abstract

Simulation-Grounded Neural Networks (SGNNs) are predictive models trained entirely on synthetic data from mechanistic simulations. They have achieved state-of-the-art performance in domains where real-world labels are limited or unobserved, but lack a formal underpinning. We place SGNNs in a unified statistical framework. Under standard loss functions, they can be interpreted as amortized Bayesian predictors trained under a simulator-induced prior. Empirical risk minimization then yields convergence to the Bayes-optimal predictor under the synthetic distribution. We employ classical results on distribution shift to characterize how performance degrades when the simulator diverges from reality. Beyond these consequences, we develop SGNN-specific results: (i) conditions under which unobserved scientific parameters are learnable via simulation, and (ii) a back-to-simulation attribution method that provides mechanistic explanations of predictions by linking them to the simulations the model deems similar, with guarantees of posterior consistency. We provide numerical experiments to validate theoretical predictions. SGNNs recover latent parameters, remain robust under mismatch, and outperform classical tools: in a model selection task, SGNNs achieve half the error of AIC in distinguishing mechanistic dynamics. These results establish SGNNs as a principled and practical framework for scientific prediction in data-limited regimes.

Paper Structure

This paper contains 76 sections, 5 theorems, 47 equations, 5 figures.

Key Result

Theorem 1

Let the loss $\ell:\mathbb{R}\times\mathcal{Y}\to[0,B]$ be convex in its first argument (a condition satisfied by standard choices such as mean squared error, cross-entropy, and quantile loss commonly used in SGNN training), $L$-Lipschitz in that argument, and bounded by $B>0$. With probability at l where $\widehat{\mathfrak{R}}_N(\mathcal{F})$ is the empirical Rademacher complexity of the functio

Figures (5)

  • Figure 1: Simulation-grounded learning schematic. Parameters $\theta$ sampled from a prior $P(\theta)$ are fed into a mechanistic model $\mathcal{M}$ (e.g., differential equations) to generate latent system dynamics $w$. An observation model $\mathcal{O}$ then transforms these into realistic observed data $x$ by adding noise, delays, and other artifacts. Together, $\mathcal{M}$ and $\mathcal{O}$ form the complete simulator $\mathcal{S} = \mathcal{O} \circ \mathcal{M}$. The SGNN $f_\phi$ learns to map from observations $x$ to target quantities $y = T(\theta)$, where $T$ can be simulator parameters, future trajectories, etc.
  • Figure 2: SGNNs approximate the Bayes-optimal predictor.Left: MSE between SGNN and Monte Carlo estimate of $f^*(x)$. Right: MSE between SGNN and ground-truth $\theta$. SGNNs outperform the baseline (dashed green line) due to amortized inference and smooth generalization.
  • Figure 3: Empirical validation of the SGNN generalization bound. We plot the test loss of a predictor trained on synthetic data ($A_0$) and evaluated under increasing misspecification ($A^* = A_0 + \delta U$). Left: Worst-case theoretical bound derived from total variation distance. Right: Empirical, data-dependent bound. In both cases, the actual test loss remains below the predicted generalization error, validating the theoretical guarantee in Theorem \ref{['thm:mismatch']}.
  • Figure 4: Attribution consistency under KL alignment training. KL divergence between SGNN attribution distribution and true posterior decreases over training epochs.
  • Figure 5: SGNNs outperform AIC for structural model selection. Classification error over training epochs for distinguishing SIR vs SEIR models from noisy trajectory data. SGNNs rapidly converge to low error rates while AIC maintains significantly higher error, demonstrating the advantage of simulation-grounded learning for unobservable structural targets.

Theorems & Definitions (15)

  • Definition 1: Mechanistic Simulator
  • Definition 2: Simulation-Grounded Neural Network (SGNN)
  • Definition 3: Model Misspecification
  • Theorem 1: Finite-sample excess-risk bound
  • Proposition 1: Consistency of the SGNN Estimator
  • Remark 1: How SGNNs Achieve Bayesian Optimality
  • Definition 4: Total Variation Mismatch
  • Theorem 2: Generalization Bound for Model Misspecification
  • Theorem 3: Back-to-Simulation Attribution Consistency
  • Definition 5: Unobservable target
  • ...and 5 more