Table of Contents
Fetching ...

Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Carson Dudley, Reiden Magdaleno, Christopher Harding, Marisa Eisenberg

TL;DR

SGNNs address the tension between mechanistic interpretability and predictive power by training neural networks on synthetic data generated from diverse mechanistic simulators, creating a structural prior over feasible system dynamics. The framework is architecture-agnostic and supports supervised learning for unobservable quantities, cross-task generalization, and mechanistic interpretability via back-to-simulation attribution. Across epidemiology, ecology, chemistry, and diffusion, SGNNs achieve robust forecasting, accurate parameter inference (e.g., early $R_0$ for COVID-19), and interpretable representations grounded in simulated regimes. This unified, simulation-grounded approach demonstrates that mechanistic simulations can serve as effective training data, enabling robust scientific inference that generalizes beyond fixed functional forms.

Abstract

Scientific modeling faces a tradeoff between the interpretability of mechanistic theory and the predictive power of machine learning. While hybrid approaches like Physics-Informed Neural Networks (PINNs) embed domain knowledge as functional constraints, they can be brittle under model misspecification. We introduce Simulation-Grounded Neural Networks (SGNNs), a framework that instead embeds domain knowledge into the training data to establish a structural prior. By pretraining on synthetic corpora spanning diverse model structures and observational artifacts, SGNNs learn the broad patterns of physical possibility. This allows the model to internalize the underlying dynamics of a system without being forced to satisfy a single, potentially incorrect equation. We evaluated SGNNs across scientific disciplines and found that this approach confers significant robustness. In prediction tasks, SGNNs nearly tripled COVID-19 forecasting skill versus CDC baselines. In tests on dengue outbreaks, SGNNs outperformed physics-constrained models even when both were restricted to incorrect human-to-human transmission equations, demonstrating that SGNNs are potentially more robust to model misspecification. For inference, SGNNs extend the logic of simulation-based inference to enable supervised learning for unobservable targets, estimating early COVID-19 transmissibility more accurately than traditional methods. Finally, SGNNs enable back-to-simulation attribution, a form of mechanistic interpretability that maps real-world data back to the simulated manifold to identify underlying processes. By unifying these disparate simulation-based techniques into a single framework, we demonstrate that mechanistic simulations can serve as effective training data for robust scientific inference that generalizes beyond the limitations of fixed functional forms.

Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

TL;DR

SGNNs address the tension between mechanistic interpretability and predictive power by training neural networks on synthetic data generated from diverse mechanistic simulators, creating a structural prior over feasible system dynamics. The framework is architecture-agnostic and supports supervised learning for unobservable quantities, cross-task generalization, and mechanistic interpretability via back-to-simulation attribution. Across epidemiology, ecology, chemistry, and diffusion, SGNNs achieve robust forecasting, accurate parameter inference (e.g., early for COVID-19), and interpretable representations grounded in simulated regimes. This unified, simulation-grounded approach demonstrates that mechanistic simulations can serve as effective training data, enabling robust scientific inference that generalizes beyond fixed functional forms.

Abstract

Scientific modeling faces a tradeoff between the interpretability of mechanistic theory and the predictive power of machine learning. While hybrid approaches like Physics-Informed Neural Networks (PINNs) embed domain knowledge as functional constraints, they can be brittle under model misspecification. We introduce Simulation-Grounded Neural Networks (SGNNs), a framework that instead embeds domain knowledge into the training data to establish a structural prior. By pretraining on synthetic corpora spanning diverse model structures and observational artifacts, SGNNs learn the broad patterns of physical possibility. This allows the model to internalize the underlying dynamics of a system without being forced to satisfy a single, potentially incorrect equation. We evaluated SGNNs across scientific disciplines and found that this approach confers significant robustness. In prediction tasks, SGNNs nearly tripled COVID-19 forecasting skill versus CDC baselines. In tests on dengue outbreaks, SGNNs outperformed physics-constrained models even when both were restricted to incorrect human-to-human transmission equations, demonstrating that SGNNs are potentially more robust to model misspecification. For inference, SGNNs extend the logic of simulation-based inference to enable supervised learning for unobservable targets, estimating early COVID-19 transmissibility more accurately than traditional methods. Finally, SGNNs enable back-to-simulation attribution, a form of mechanistic interpretability that maps real-world data back to the simulated manifold to identify underlying processes. By unifying these disparate simulation-based techniques into a single framework, we demonstrate that mechanistic simulations can serve as effective training data for robust scientific inference that generalizes beyond the limitations of fixed functional forms.

Paper Structure

This paper contains 57 sections, 13 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Simulation-Grounded Neural Networks (SGNNs) use mechanistic simulations as synthetic supervision for scientific modeling.Top: SGNNs are trained on synthetic datasets generated by mechanistic simulators spanning diverse model structures and parameter regimes. Real-world observational artifacts---such as noise, bias, and missingness---are explicitly simulated to mimic real-world data. Neural networks are trained on this synthetic corpus to learn latent mechanistic patterns, enabling generalization to real-world data without retraining. Bottom: SGNNs unify the strengths of mechanistic and data-driven models. They encode scientific theory during training, generalize out-of-the-box without tuning, enable supervised learning for unobservable scientific quantities, support cross-task modeling across domains, and provide process-level interpretability through back-to-simulation attribution.
  • Figure 2: SGNNs outperform state-of-the-art baselines in real-world disease forecasting and reveal the importance of mechanistic grounding.(A) SGNNs achieve 35.3% forecasting skill on early COVID-19 mortality, almost tripling the CDC Forecast Hub median and exceeding its best model—despite using no real COVID-19 data. (B) SGNNs produce accurate forecasts with calibrated uncertainty across diverse real-world locations, including Michigan, New York, Texas, and California. (C) Across all U.S. states, SGNNs consistently achieve lower error (lighter) than Chronos and PINNs, showing robust generalization from synthetic pretraining. (D) On dengue, a domain with fundamentally different transmission dynamics, SGNNs outperform both PINNs and statistical models in a fully zero-shot setting—demonstrating robustness to mechanistic misspecification. (E) Replacing mechanistic simulations with neural simulators causes performance collapse, highlighting the necessity of mechanistic fidelity. (F) Restricting pretraining to simple SEIR models leads to overconfident exponential forecasts, confirming that mechanistic diversity and surveillance realism are essential for robust downstream performance.
  • Figure 3: SGNNs generalize across scientific domains and task types.(A)Ecological forecasting: SGNNs outperform task-specific neural networks on both low-dimensional predator-prey systems (hare and lynx) and high-dimensional multispecies forecasting from the UK Butterfly Monitoring Scheme. SGNNs maintain forecasting skill as the number of species increases, while baselines degrade sharply. (B)Chemical yield prediction: SGNNs reduce residual variance by one-third compared to state-of-the-art models trained directly on the Suzuki-Miyaura reaction dataset. Right: predicted vs. actual yield for SGNN (blue) and baseline (orange) models, showing tighter clustering around the identity line for SGNN. (C)Diffusion source identification: SGNNs accurately infer the source of information spread in partially observed cascades. Left: top-$k$ accuracy comparison shows SGNNs outperform the Rumor Center method. Right: predicted source probabilities for a representative cascade, with node color and size indicating model-assigned likelihood. SGNN correctly assigns high probability to the true source despite missing observations.
  • Figure 4: SGNNs accurately infer unobservable parameters and provide mechanistic interpretability via back-to-simulation attribution.(A) SGNNs infer a high reproduction number ($R_0 = 6.14$) for New York City from early COVID-19 case data (Feb–Mar 2020), aligning with estimates from more complete datasets. Traditional methods underestimated early transmission due to underreporting and simplifying assumptions. (B) SGNNs achieve significantly lower mean squared error (left) and mean percentage error (right) in $R_0$ estimation compared to maximum likelihood estimation (MLE) and exponential growth-based methods. (C) Back-to-simulation attribution retrieves the 50 most similar synthetic outbreaks to a real-world input and visualizes the distribution of their underlying mechanistic parameters. For Michigan COVID-19 mortality, the retrieved simulations align closely with true estimates for hospitalization rate, population size, asymptomatic transmission rate, and hospitalization fatality rate (dashed lines), confirming that the SGNN’s internal representations encode meaningful mechanistic structure.
  • Figure 5: SGNNs outperform classical models in forecasting lynx-hare predator-prey dynamics. Rolling forecasts are shown for four evaluation windows, comparing SGNN (red) to mechanistic models (RMG, blue) and VARMA (green) models. True population trajectories are plotted in black. SGNNs consistently maintain accurate phase and amplitude tracking, while baselines deteriorate.