Table of Contents
Fetching ...

QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios

Timo Pierre Schrader, Lukas Lange, Simon Razniewski, Annemarie Friedrich

TL;DR

QUITE introduces a high-fidelity, linguistically varied dataset for validating uncertainty-aware Bayesian reasoning in natural language, encompassing numeric and words-of-estimative-probability premises, evidences, and queries across causal, evidential, and explaining-away patterns. The authors propose a neuro-symbolic pipeline that parses natural language into ProbLog representations and solves them with probabilistic logic, showing that ProbLog-FT and related neuro-symbolic methods substantially outperform vanilla LLM prompting, especially on non-forward reasoning tasks. Key contributions include the dataset creation pipeline, the first explicit separation of three Bayesian inference types in evaluation, and a thorough analysis of model strengths, weaknesses, and sources of error. The findings suggest that integrating principled probabilistic reasoning with fine-tuned semantic parsing is essential for robust, real-world Bayesian reasoning in NLP, with implications for deploying reliable uncertainty-aware systems. The work also outlines directions for future robustness, interventional queries, and handling longer, more complex text in uncertainty reasoning contexts.

Abstract

Reasoning is key to many decision making processes. It requires consolidating a set of rule-like premises that are often associated with degrees of uncertainty and observations to draw conclusions. In this work, we address both the case where premises are specified as numeric probabilistic rules and situations in which humans state their estimates using words expressing degrees of certainty. Existing probabilistic reasoning datasets simplify the task, e.g., by requiring the model to only rank textual alternatives, by including only binary random variables, or by making use of a limited set of templates that result in less varied text. In this work, we present QUITE, a question answering dataset of real-world Bayesian reasoning scenarios with categorical random variables and complex relationships. QUITE provides high-quality natural language verbalizations of premises together with evidence statements and expects the answer to a question in the form of an estimated probability. We conduct an extensive set of experiments, finding that logic-based models outperform out-of-the-box large language models on all reasoning types (causal, evidential, and explaining-away). Our results provide evidence that neuro-symbolic models are a promising direction for improving complex reasoning. We release QUITE and code for training and experiments on Github.

QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios

TL;DR

QUITE introduces a high-fidelity, linguistically varied dataset for validating uncertainty-aware Bayesian reasoning in natural language, encompassing numeric and words-of-estimative-probability premises, evidences, and queries across causal, evidential, and explaining-away patterns. The authors propose a neuro-symbolic pipeline that parses natural language into ProbLog representations and solves them with probabilistic logic, showing that ProbLog-FT and related neuro-symbolic methods substantially outperform vanilla LLM prompting, especially on non-forward reasoning tasks. Key contributions include the dataset creation pipeline, the first explicit separation of three Bayesian inference types in evaluation, and a thorough analysis of model strengths, weaknesses, and sources of error. The findings suggest that integrating principled probabilistic reasoning with fine-tuned semantic parsing is essential for robust, real-world Bayesian reasoning in NLP, with implications for deploying reliable uncertainty-aware systems. The work also outlines directions for future robustness, interventional queries, and handling longer, more complex text in uncertainty reasoning contexts.

Abstract

Reasoning is key to many decision making processes. It requires consolidating a set of rule-like premises that are often associated with degrees of uncertainty and observations to draw conclusions. In this work, we address both the case where premises are specified as numeric probabilistic rules and situations in which humans state their estimates using words expressing degrees of certainty. Existing probabilistic reasoning datasets simplify the task, e.g., by requiring the model to only rank textual alternatives, by including only binary random variables, or by making use of a limited set of templates that result in less varied text. In this work, we present QUITE, a question answering dataset of real-world Bayesian reasoning scenarios with categorical random variables and complex relationships. QUITE provides high-quality natural language verbalizations of premises together with evidence statements and expects the answer to a question in the form of an estimated probability. We conduct an extensive set of experiments, finding that logic-based models outperform out-of-the-box large language models on all reasoning types (causal, evidential, and explaining-away). Our results provide evidence that neuro-symbolic models are a promising direction for improving complex reasoning. We release QUITE and code for training and experiments on Github.

Paper Structure

This paper contains 41 sections, 12 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Percentage of instances solved correctly for each Bayesian reasoning type. The neuro-symbolic Mistral-FT+ProbLog approach is robust against the inherent difficulties of different reasoning types.
  • Figure 2: Example instances of Quite. Each question is categorized according to the reasoning pattern.
  • Figure 3: Exemplary network from Quite about the relationship between gallstones, flatulence and amylase levels.
  • Figure 4: Full ProbLog code for the gallstone-flatulence-amylase instance.
  • Figure 5: Example network for demonstrating our procedure of subsetting.
  • ...and 8 more figures