Table of Contents
Fetching ...

AutoStan: Autonomous Bayesian Model Improvement via Predictive Feedback

Oliver Dürr

Abstract

We present AutoStan, a framework in which a command-line interface (CLI) coding agent autonomously builds and iteratively improves Bayesian models written in Stan. The agent operates in a loop, writing a Stan model file, executing MCMC sampling, then deciding whether to keep or revert each change based on two complementary feedback signals: the negative log predictive density (NLPD) on held-out data and the sampler's own diagnostics (divergences, R-hat, effective sample size). We evaluate AutoStan on five datasets with diverse modeling structures. On a synthetic regression dataset with outliers, the agent progresses from naive linear regression to a model with Student-t robustness, nonlinear heteroscedastic structure, and an explicit contamination mixture, matching or outperforming TabPFN, a state-of-the-art black-box method, while remaining fully interpretable. Across four additional experiments, the same mechanism discovers hierarchical partial pooling, varying-slope models with correlated random effects, and a Poisson attack/defense model for soccer. No search algorithm, critic module, or domain-specific instructions are needed. This is, to our knowledge, the first demonstration that a CLI coding agent can autonomously write and iteratively improve Stan code for diverse Bayesian modeling problems.

AutoStan: Autonomous Bayesian Model Improvement via Predictive Feedback

Abstract

We present AutoStan, a framework in which a command-line interface (CLI) coding agent autonomously builds and iteratively improves Bayesian models written in Stan. The agent operates in a loop, writing a Stan model file, executing MCMC sampling, then deciding whether to keep or revert each change based on two complementary feedback signals: the negative log predictive density (NLPD) on held-out data and the sampler's own diagnostics (divergences, R-hat, effective sample size). We evaluate AutoStan on five datasets with diverse modeling structures. On a synthetic regression dataset with outliers, the agent progresses from naive linear regression to a model with Student-t robustness, nonlinear heteroscedastic structure, and an explicit contamination mixture, matching or outperforming TabPFN, a state-of-the-art black-box method, while remaining fully interpretable. Across four additional experiments, the same mechanism discovers hierarchical partial pooling, varying-slope models with correlated random effects, and a Poisson attack/defense model for soccer. No search algorithm, critic module, or domain-specific instructions are needed. This is, to our knowledge, the first demonstration that a CLI coding agent can autonomously write and iteratively improve Stan code for diverse Bayesian modeling problems.

Paper Structure

This paper contains 42 sections, 5 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: AutoStan on the large 1D regression dataset ($n{=}500$ train, 200 test; DGP defined in Section \ref{['sec:setup']}). In panels (b)--(d), the fitted mean $\hat{f}(x)$ is plotted with $\pm2\hat{\sigma}(x)$ shaded; the faint grey overlay marks the oracle ground truth $f(x)\pm2\sigma(x)$. Training points outside the plot range $[-5,7]$ are shown as edge-pinned triangles ($\blacktriangle$/$\blacktriangledown$) at the panel boundary. (a) NLPD trajectory over 15 iterations; colored markers identify the three model-fit panels. The dashed oracle line ($\mathrm{NLPD}=1.14$) is a lower bound that cannot be reached (see Section \ref{['sec:setup']}). (b) Baseline (iter 0): Gaussian linear model; predictive bands are dominated by the ${\approx}30$ extreme training outliers ($\mathrm{NLPD}=2.16$). (c) Iter 1: cubic polynomial mean $+$ Student-$t$ likelihood; one step eliminates most band inflation ($\mathrm{NLPD}=1.32$). (d) Iter 11 (best): sinusoidal mean with learned frequency $\omega$, heteroscedastic log-linear variance, and a contamination-mixture likelihood; tight, locally calibrated bands closely follow the oracle noise envelope ($\mathrm{NLPD}=1.23$). (e) TabPFN (90% predictive interval, $\mathrm{NLPD}=1.25$): mean tracking is accurate but intervals are uniformly too wide---the model has absorbed the training outliers, inflating uncertainty across the full input range.
  • Figure 2: AutoStan on the 1D regression small dataset ($n{=}68$). (a) NLPD trajectory over 9 iterations. (b) Baseline: linear + Gaussian, enormous bands from 4 extreme outliers. (c) Iter 1: quadratic mean + Student-$t$, the largest gain ($\Delta{=}0.75$). (d) Iter 5 (best): cubic + log-linear $\sigma(x)$ + contamination mixture; NLPD 1.1244, matching TabPFN (1.1202) while remaining fully interpretable.
  • Figure 3: Bundesliga model parameters (iteration 9). (a) Dortmund (0.39), Freiburg (0.33), Bayern (0.32) benefit most from playing at home; Heidenheim ($-$0.01) and St. Pauli (0.00) show no home advantage. (b) Attack vs. defense with 90% credible intervals; Bayern's dominance is clearly visible.
  • Figure 4: TabPFN predictions. On the large dataset, the intervals widen globally near outliers instead of isolating them, explaining TabPFN's higher NLPD (1.2501 vs. AutoStan's 1.2256).