Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge
Drago Plecko, Patrik Okanovic, Shreyas Havaldar, Torsten Hoefler, Elias Bareinboim
TL;DR
The paper introduces an observational-distribution benchmark for large language models grounded in Pearl's Causal Hierarchy, aiming to assess whether LLMs internalize real-world population distributions. It uses 10 large US-population datasets to generate 169 tasks evaluated via QA and likelihood prompting with a bootstrap-based distributional scoring scheme. Across open- and closed-weight models, results show only modest alignment with ground-truth distributions and no robust gains from instruction tuning or fine-tuning, challenging the notion of universal distributional learning. The findings have implications for the use of LLMs in causal inference and highlight the need for improved methods to instill reliable probabilistic knowledge in language models.
Abstract
Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., "what is the capital of England?"), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g., "what is the sex of a computer science graduate in the US?"). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions. Indeed, LLMs are touted as powerful universal approximators of real-world distributions. At the same time, classical results in statistics, known as curse of dimensionality, highlight fundamental challenges in learning distributions in high dimensions, challenging the notion of universal distributional learning. In this work, we develop the first benchmark to directly test this hypothesis, evaluating whether LLMs have access to empirical distributions describing real-world populations across domains such as economics, health, education, and social behavior. Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally. When interpreted in the context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that language models do not contain knowledge on observational distributions (Layer 1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional (Layer 2) and counterfactual (Layer 3) knowledge of these models is also limited.
