Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Carson Dudley; Reiden Magdaleno; Christopher Harding; Marisa Eisenberg

Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Carson Dudley, Reiden Magdaleno, Christopher Harding, Marisa Eisenberg

TL;DR

SGNNs address the tension between mechanistic interpretability and predictive power by training neural networks on synthetic data generated from diverse mechanistic simulators, creating a structural prior over feasible system dynamics. The framework is architecture-agnostic and supports supervised learning for unobservable quantities, cross-task generalization, and mechanistic interpretability via back-to-simulation attribution. Across epidemiology, ecology, chemistry, and diffusion, SGNNs achieve robust forecasting, accurate parameter inference (e.g., early $R_0$ for COVID-19), and interpretable representations grounded in simulated regimes. This unified, simulation-grounded approach demonstrates that mechanistic simulations can serve as effective training data, enabling robust scientific inference that generalizes beyond fixed functional forms.

Abstract

Scientific modeling faces a tradeoff between the interpretability of mechanistic theory and the predictive power of machine learning. While hybrid approaches like Physics-Informed Neural Networks (PINNs) embed domain knowledge as functional constraints, they can be brittle under model misspecification. We introduce Simulation-Grounded Neural Networks (SGNNs), a framework that instead embeds domain knowledge into the training data to establish a structural prior. By pretraining on synthetic corpora spanning diverse model structures and observational artifacts, SGNNs learn the broad patterns of physical possibility. This allows the model to internalize the underlying dynamics of a system without being forced to satisfy a single, potentially incorrect equation. We evaluated SGNNs across scientific disciplines and found that this approach confers significant robustness. In prediction tasks, SGNNs nearly tripled COVID-19 forecasting skill versus CDC baselines. In tests on dengue outbreaks, SGNNs outperformed physics-constrained models even when both were restricted to incorrect human-to-human transmission equations, demonstrating that SGNNs are potentially more robust to model misspecification. For inference, SGNNs extend the logic of simulation-based inference to enable supervised learning for unobservable targets, estimating early COVID-19 transmissibility more accurately than traditional methods. Finally, SGNNs enable back-to-simulation attribution, a form of mechanistic interpretability that maps real-world data back to the simulated manifold to identify underlying processes. By unifying these disparate simulation-based techniques into a single framework, we demonstrate that mechanistic simulations can serve as effective training data for robust scientific inference that generalizes beyond the limitations of fixed functional forms.

Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

TL;DR

Abstract

Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)