Table of Contents
Fetching ...

A Probabilistic Framework for LLM-Based Model Discovery

Stefan Wahl, Raphaela Schenk, Ali Farnoud, Jakob H. Macke, Daniel Gedon

TL;DR

This work recast model discovery as probabilistic inference, i.e., as sampling from an unknown distribution over mechanistic models capable of explaining the data, and introduces ModelSMC, an algorithm based on Sequential Monte Carlo sampling that represents candidate models as particles which are iteratively proposed and refined by an LLM, and weighted using likelihood-based criteria.

Abstract

Automated methods for discovering mechanistic simulator models from observational data offer a promising path toward accelerating scientific progress. Such methods often take the form of agentic-style iterative workflows that repeatedly propose and revise candidate models by imitating human discovery processes. However, existing LLM-based approaches typically implement such workflows via hand-crafted heuristic procedures, without an explicit probabilistic formulation. We recast model discovery as probabilistic inference, i.e., as sampling from an unknown distribution over mechanistic models capable of explaining the data. This perspective provides a unified way to reason about model proposal, refinement, and selection within a single inference framework. As a concrete instantiation of this view, we introduce ModelSMC, an algorithm based on Sequential Monte Carlo sampling. ModelSMC represents candidate models as particles which are iteratively proposed and refined by an LLM, and weighted using likelihood-based criteria. Experiments on real-world scientific systems illustrate that this formulation discovers models with interpretable mechanisms and improves posterior predictive checks. More broadly, this perspective provides a probabilistic lens for understanding and developing LLM-based approaches to model discovery.

A Probabilistic Framework for LLM-Based Model Discovery

TL;DR

This work recast model discovery as probabilistic inference, i.e., as sampling from an unknown distribution over mechanistic models capable of explaining the data, and introduces ModelSMC, an algorithm based on Sequential Monte Carlo sampling that represents candidate models as particles which are iteratively proposed and refined by an LLM, and weighted using likelihood-based criteria.

Abstract

Automated methods for discovering mechanistic simulator models from observational data offer a promising path toward accelerating scientific progress. Such methods often take the form of agentic-style iterative workflows that repeatedly propose and revise candidate models by imitating human discovery processes. However, existing LLM-based approaches typically implement such workflows via hand-crafted heuristic procedures, without an explicit probabilistic formulation. We recast model discovery as probabilistic inference, i.e., as sampling from an unknown distribution over mechanistic models capable of explaining the data. This perspective provides a unified way to reason about model proposal, refinement, and selection within a single inference framework. As a concrete instantiation of this view, we introduce ModelSMC, an algorithm based on Sequential Monte Carlo sampling. ModelSMC represents candidate models as particles which are iteratively proposed and refined by an LLM, and weighted using likelihood-based criteria. Experiments on real-world scientific systems illustrate that this formulation discovers models with interpretable mechanisms and improves posterior predictive checks. More broadly, this perspective provides a probabilistic lens for understanding and developing LLM-based approaches to model discovery.
Paper Structure (55 sections, 2 theorems, 47 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 55 sections, 2 theorems, 47 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\pi(m) \propto p(x_o| m)p(m)$ be a fixed target distribution over models, and let $q(m' | m)$ denote an idealized proposal kernel induced by the propagation step. Assume (i) support coverage, i.e., any $m$ with $\pi(m)>0$ can be generated with non-zero probability; (ii) uniformly bounded impo with asymptotic variance $\mathcal{O}(1/N)$. The result holds in the presence of resampling.

Figures (5)

  • Figure 1: Overview of ModelSMC for automated LLM-based model discovery.(a) Given a textual problem formulation and context data, we infer a simulator model implemented in code. (b) ModelSMC iteratively refines an initial model to sample high-density regions of the model posterior $p(m | {\bm{x}}_o)$, approaching the unknown data-generating process (red star) in high-density regions. (c) ModelSMC is inspired by SMC, which approximates evolving distributions via weighted particles by iterative resampling, propagation, and weighting. (d) In ModelSMC, models are propagated by LLM sampling and weighted by likelihood evaluation.
  • Figure 2: Relative sampling frequency of the target model for LLM-free ModelSMC. Sampling frequencies averaged over ten different target models and ten experiments each. Shaded region: The highest and the lowest sampling frequency observed at each time step over all runs.
  • Figure 3: Systems pharmacology kidney model with experimental data.(a) Original code snippet in R for the aldosterone mechanism (left) and one instance of the inferred model code (right). (b) Posterior predictive for the code instance in (a) with real-world data points.
  • Figure 4: Hodgkin--Huxley model with Allen cell types database data. Results are shown for one representative random seed from the 10 runs, used consistently across panels. (a) ModelSMC convergence across runs, showing the mean (solid line) and 95% percentiles (shaded). With lower opacity, we overlay the full model ancestry of the selected run. (b) For the selected run, highlighting of inferred ion-channel mechanisms at successive improvement stages. (c) Posterior predictive voltage simulations for real observations from the Allen database, comparing the baseline HH model to the best model identified in the selected run.
  • Figure G-1: Performance of the discovered models as a function of token usage. Each row corresponds to a task, and each column to a performance metric. Dark blue: ModelSMC. Light Blue: Model SMC $N$=1. Green: FunSearch+. The depicted performance metrics and token counts are averaged over the five best discovery runs out of the ten discovery runs conducted for the experiments discussed in \ref{['sec:quantitative_results']}. First row: Synthetic data from a SIR epidemiological model (\ref{['app:ExperimentalDetails_SIR']}). Second row: Hodgkin-Huxley Model (\ref{['sec:exp-allen']} and \ref{['app:ExperimentalDetails_allen']}). Third row: Pharmacological kidney model (\ref{['sec:exp-kidney']} and \ref{['app:ExperimentalDetails_kidney']}). Left column: Negative average log-likelihood of the observed data marginalized over the model parameters $\theta$, i.e., the resampling weight for ModelSMC (\ref{['eq:ResamplingWeightModelSMCFactorization']}). Right column: Negative average log-likelihood of the observed data given the parameter estimate $\hat{\theta}$ (\ref{['eq:appendix_FunSearch_likelihood_score']})

Theorems & Definitions (3)

  • Theorem 3.1: Consistency of ModelSMC
  • Theorem C.1: restated \ref{['thm:ConsistencyProofModelSMC']}
  • proof