Foundation Priors
Sanjog Misra
TL;DR
Foundation priors formalize the use of foundation-model outputs as subjective prior information rather than empirical data. The approach uses an exponential tilt (via a trust parameter λ) to combine a user’s prior with synthetic-likelihood information drawn from prompts, while addressing subjectivity through integration across heterogeneous prompts and calibration with real data. The framework yields a tractable posterior and supports diverse applications from initialization and power analysis to latent-construct modeling and hierarchical settings. This provides a principled, transparent pathway for leveraging synthetic data in empirical work without conflating it with real observations.
Abstract
Foundation models, and in particular large language models, can generate highly informative responses, prompting growing interest in using these ''synthetic'' outputs as data in empirical research and decision-making. This paper introduces the idea of a foundation prior, which shows that model-generated outputs are not as real observations, but draws from the foundation prior induced prior predictive distribution. As such synthetic data reflects both the model's learned patterns and the user's subjective priors, expectations, and biases. We model the subjectivity of the generative process by making explicit the dependence of synthetic outputs on the user's anticipated data distribution, the prompt-engineering process, and the trust placed in the foundation model. We derive the foundation prior as an exponential-tilted, generalized Bayesian update of the user's primitive prior, where a trust parameter governs the weight assigned to synthetic data. We then show how synthetic data and the associated foundation prior can be incorporated into standard statistical and econometric workflows, and discuss their use in applications such as refining complex models, informing latent constructs, guiding experimental design, and augmenting random-coefficient and partially linear specifications. By treating generative outputs as structured, explicitly subjective priors rather than as empirical observations, the framework offers a principled way to harness foundation models in empirical work while avoiding the conflation of synthetic ''facts'' with real data.
