QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios
Timo Pierre Schrader, Lukas Lange, Simon Razniewski, Annemarie Friedrich
TL;DR
QUITE introduces a high-fidelity, linguistically varied dataset for validating uncertainty-aware Bayesian reasoning in natural language, encompassing numeric and words-of-estimative-probability premises, evidences, and queries across causal, evidential, and explaining-away patterns. The authors propose a neuro-symbolic pipeline that parses natural language into ProbLog representations and solves them with probabilistic logic, showing that ProbLog-FT and related neuro-symbolic methods substantially outperform vanilla LLM prompting, especially on non-forward reasoning tasks. Key contributions include the dataset creation pipeline, the first explicit separation of three Bayesian inference types in evaluation, and a thorough analysis of model strengths, weaknesses, and sources of error. The findings suggest that integrating principled probabilistic reasoning with fine-tuned semantic parsing is essential for robust, real-world Bayesian reasoning in NLP, with implications for deploying reliable uncertainty-aware systems. The work also outlines directions for future robustness, interventional queries, and handling longer, more complex text in uncertainty reasoning contexts.
Abstract
Reasoning is key to many decision making processes. It requires consolidating a set of rule-like premises that are often associated with degrees of uncertainty and observations to draw conclusions. In this work, we address both the case where premises are specified as numeric probabilistic rules and situations in which humans state their estimates using words expressing degrees of certainty. Existing probabilistic reasoning datasets simplify the task, e.g., by requiring the model to only rank textual alternatives, by including only binary random variables, or by making use of a limited set of templates that result in less varied text. In this work, we present QUITE, a question answering dataset of real-world Bayesian reasoning scenarios with categorical random variables and complex relationships. QUITE provides high-quality natural language verbalizations of premises together with evidence statements and expects the answer to a question in the form of an estimated probability. We conduct an extensive set of experiments, finding that logic-based models outperform out-of-the-box large language models on all reasoning types (causal, evidential, and explaining-away). Our results provide evidence that neuro-symbolic models are a promising direction for improving complex reasoning. We release QUITE and code for training and experiments on Github.
