Table of Contents
Fetching ...

What Are the Odds? Language Models Are Capable of Probabilistic Reasoning

Akshay Paruchuri, Jake Garrison, Shun Liao, John Hernandez, Jacob Sunshine, Tim Althoff, Xin Liu, Daniel McDuff

TL;DR

This paper performs a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities, using idealized and real-world statistical distributions.

Abstract

Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper, we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We evaluate three ways to provide context to LMs 1) anchoring examples from within a distribution or family of distributions, 2) real-world context, 3) summary statistics on which to base a Normal approximation. Models can make inferences about distributions, and can be further aided by the incorporation of real-world context, example shots and simplified assumptions, even if these assumptions are incorrect or misspecified. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we have released publicly.

What Are the Odds? Language Models Are Capable of Probabilistic Reasoning

TL;DR

This paper performs a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities, using idealized and real-world statistical distributions.

Abstract

Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper, we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We evaluate three ways to provide context to LMs 1) anchoring examples from within a distribution or family of distributions, 2) real-world context, 3) summary statistics on which to base a Normal approximation. Models can make inferences about distributions, and can be further aided by the incorporation of real-world context, example shots and simplified assumptions, even if these assumptions are incorrect or misspecified. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we have released publicly.
Paper Structure (24 sections, 9 figures, 3 tables)

This paper contains 24 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: LMs & Probabilistic Reasoning. Models can make inferences about distributions, but can be aided by the incorporation of real-world context, example shots and simplified assumptions, even if these assumptions are incorrect or misspecified.
  • Figure 2: Distributions. A visualization of the 12 idealized and 12 real-world distributions across the domains of health, finance, and climate involved in our evaluation.
  • Figure 3: Results on Idealized Distributions. Model results (top) estimating percentiles, (middle) drawing samples, (bottom) estimating probabilities, for five common distributions (see \ref{['sec:appendix_idealized_distributions_results_summaries']} for results on all distributions).
  • Figure 4: Language models appear to interpolate between in-context examples. Comparison of within family and within distribution shot types to a baseline where the answer is based on the nearest corresponding shot to the target percentile value (nearest neighbor), importantly the baseline does not perform any interpolation between percentiles.
  • Figure 5: Inferences can be aided by context and simplified assumptions. Mean absolute error in calculating percentiles for real-world distributions with different prompts, including idealized distributions without real-world context, added real-world context, and a Normal approximation approach that simplifies parameter content. (*) designates $p < 0.05$ for all possible pairs using the Wilcoxon signed-rank test.
  • ...and 4 more figures