Table of Contents
Fetching ...

Let's Think Var-by-Var: Large Language Models Enable Ad Hoc Probabilistic Reasoning

Shepard Xia, Brian Lu, Jason Eisner

TL;DR

This work investigates whether large language models can supply commonsense knowledge to build ad hoc probabilistic models for novel, data-scarce questions (guesstimation). It introduces a pipeline where an LLM proposes variables and constraint relationships, which are then integrated into a log-linear model via a fuzzy maximum-entropy objective that aims to satisfy moment constraints. The approach formalizes moment constraints and exact inference on a graphical model, producing p_ heta that respects LLM-derived information. Across three real-world tabular datasets (AIR, ATUS, WVS), the framework achieves performance comparable to direct LLM prompting and demonstrates robustness to noise, highlighting a principled path to augment probabilistic reasoning with learned commonsense. The results suggest the method is viable for harder questions and point to future directions in prompt design, regularization, and scaling to richer variable spaces.

Abstract

A hallmark of intelligence is the ability to flesh out underspecified situations using "common sense." We propose to extract that common sense from large language models (LLMs), in a form that can feed into probabilistic inference. We focus our investigation on $\textit{guesstimation}$ questions such as "How much are Airbnb listings in Newark, NJ?" Formulating a sensible answer without access to data requires drawing on, and integrating, bits of common knowledge about how $\texttt{Price}$ and $\texttt{Location}$ may relate to other variables, such as $\texttt{Property Type}$. Our framework answers such a question by synthesizing an $\textit{ad hoc}$ probabilistic model. First we prompt an LLM to propose a set of random variables relevant to the question, followed by moment constraints on their joint distribution. We then optimize the joint distribution $p$ within a log-linear family to maximize the overall constraint satisfaction. Our experiments show that LLMs can successfully be prompted to propose reasonable variables, and while the proposed numerical constraints can be noisy, jointly optimizing for their satisfaction reconciles them. When evaluated on probabilistic questions derived from three real-world tabular datasets, we find that our framework performs comparably to a direct prompting baseline in terms of total variation distance from the dataset distribution, and is similarly robust to noise.

Let's Think Var-by-Var: Large Language Models Enable Ad Hoc Probabilistic Reasoning

TL;DR

This work investigates whether large language models can supply commonsense knowledge to build ad hoc probabilistic models for novel, data-scarce questions (guesstimation). It introduces a pipeline where an LLM proposes variables and constraint relationships, which are then integrated into a log-linear model via a fuzzy maximum-entropy objective that aims to satisfy moment constraints. The approach formalizes moment constraints and exact inference on a graphical model, producing p_ heta that respects LLM-derived information. Across three real-world tabular datasets (AIR, ATUS, WVS), the framework achieves performance comparable to direct LLM prompting and demonstrates robustness to noise, highlighting a principled path to augment probabilistic reasoning with learned commonsense. The results suggest the method is viable for harder questions and point to future directions in prompt design, regularization, and scaling to richer variable spaces.

Abstract

A hallmark of intelligence is the ability to flesh out underspecified situations using "common sense." We propose to extract that common sense from large language models (LLMs), in a form that can feed into probabilistic inference. We focus our investigation on questions such as "How much are Airbnb listings in Newark, NJ?" Formulating a sensible answer without access to data requires drawing on, and integrating, bits of common knowledge about how and may relate to other variables, such as . Our framework answers such a question by synthesizing an probabilistic model. First we prompt an LLM to propose a set of random variables relevant to the question, followed by moment constraints on their joint distribution. We then optimize the joint distribution within a log-linear family to maximize the overall constraint satisfaction. Our experiments show that LLMs can successfully be prompted to propose reasonable variables, and while the proposed numerical constraints can be noisy, jointly optimizing for their satisfaction reconciles them. When evaluated on probabilistic questions derived from three real-world tabular datasets, we find that our framework performs comparably to a direct prompting baseline in terms of total variation distance from the dataset distribution, and is similarly robust to noise.

Paper Structure

This paper contains 48 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An illustration of our proposed framework applied to answering an example probabilistic question, $Q =$ "How much would an Airbnb with at least two beds cost in Newark, NJ?". Going clockwise from $Q$, we first prompt an LLM to brainstorm the relevant random variables (\ref{['sec:prompting']} (a)), producing Price (P), Rating (R), Beds (B), Location (L), where shaded nodes denote variables being conditioned on, blue nodes denote target variables, and white nodes denote latent variables. Then we prompt an LLM to propose interacting pairs $\{{\textnormal{v}}_1, {\textnormal{v}}_2\}$ of proposed variables, and whether to constrain $p({\textnormal{v}}_1 \mid {\textnormal{v}}_2)$ or $p({\textnormal{v}}_2 \mid {\textnormal{v}}_1)$ (\ref{['sec:prompting']} (b)). Next we prompt LLMs to propose numeric constraints on the marginal $p({\textnormal{v}})$ of every proposed variable, as well as the conditional marginals $p({\textnormal{v}}_1 \mid {\textnormal{v}}_2)$ of every proposed pairwise interaction (\ref{['sec:prompting']} (c)); Finally, we optimize the parameters of a log-linear model with fuzzy maximum entropy objective \ref{['eq:fuzzyME']} in order to maximize constraint satisfaction (\ref{['sec:prompting']} (c)). The final output is an ad hoc probability model that can be used to answer $Q$. Going counter-clockwise from $Q$ is a baseline of asking for an estimate of $Q$ directly using a zero-shot LLM with Chain-of-Thought.
  • Figure 2: Breakdown of the end-to-end evaluation (\ref{['sec:exp2']}) by number of conditions in the question.
  • Figure 3: Results of intervention experiments (\ref{['sec:exp2']}). "Us" in this figure refers to our approach. Top row corresponds to results on the Main set of questions on AIR domain, bottom row corresponds to the Main set of questions on ATUS domain. Columns 1 and 2 visualize results of interventions 1 and 2, which randomly replaces zero to two latent variables with a different one after stages (a) and (b) of \ref{['sec:prompting']}, and randomly reverses the direction of zero to three queries ${\textnormal{v}}_1 \mid {\textnormal{v}}_2$ to ${\textnormal{v}}_2 \mid {\textnormal{v}}_1$ after stage (b), respectively. Their $x$-axes denote the number of intervened nodes/queries, and their $y$-axes denote the average $\text{TVD}(p, \hat{p}_{\text{us}}) - \text{TVD}(p, \hat{p}_{\text{direct prompt}})$. The error bars denote one standard deviation of the average. Columns 3 and 4 correspond to intervention 3-oracle and intervention 3-noise. Their $x$-axes are the interpolation coefficient, and their $y$-axes are $TVD(p, \hat{p}_{\text{us}})$.
  • Figure 4: Scatterplot of the total variation distance against reference, Us versus Direct Prompt, on the Main set of questions for Inside Airbnb. Each point in the plot corresponds to a question from Main on a particular evaluation split (one of Ashville, Austin, Chicago, New Orleans, Pacific Grove, and Rhode Island), averaged over three random executions at temperature 0.2. The color of a point denote the number of conditions in the question. The other domains (ATUS and WVS) and the other set of questions (Focus) show a similar pattern in their scatterplots (not shown here).