Table of Contents
Fetching ...

BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

Peter Clark, Bhavana Dalvi Mishra, Oyvind Tafjord

TL;DR

BaRDa introduces a Belief and Reasoning Dataset that cleanly separates factual accuracy from reasoning ability by using four entailment types (TT, TF, FT, FF) and counterfactual premises. It combines data from EntailmentBank, Entailer, and GPT-3 generated content to create 3000 gold-labeled entailments over 9000 statements, quantified via Belief Accuracy, Reasoning Accuracy, and Reasoning Consistency (based on the metric $\tau$ with consistency = $1-\tau$). The paper benchmarks four GPT-family models, revealing a clear trend of improving factual and reasoning performance with model size, while highlighting model-specific biases in consistency. It discusses limitations such as scope, noise in labels, and prompting effects, and positions BaRDa as a useful community resource for evaluating belief and reasoning in current and future models.

Abstract

While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a collection of human-annotated *entailment trees*, engineered to express both good and bad chains of reasoning, and using a mixture of true and false facts, in particular including counterfactual examples, to avoid belief bias (also known as the "content effect"). The resulting dataset, called BaRDa, contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319 false statements. Testing on four GPT-series models, GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions.

BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

TL;DR

BaRDa introduces a Belief and Reasoning Dataset that cleanly separates factual accuracy from reasoning ability by using four entailment types (TT, TF, FT, FF) and counterfactual premises. It combines data from EntailmentBank, Entailer, and GPT-3 generated content to create 3000 gold-labeled entailments over 9000 statements, quantified via Belief Accuracy, Reasoning Accuracy, and Reasoning Consistency (based on the metric with consistency = ). The paper benchmarks four GPT-family models, revealing a clear trend of improving factual and reasoning performance with model size, while highlighting model-specific biases in consistency. It discusses limitations such as scope, noise in labels, and prompting effects, and positions BaRDa as a useful community resource for evaluating belief and reasoning in current and future models.

Abstract

While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a collection of human-annotated *entailment trees*, engineered to express both good and bad chains of reasoning, and using a mixture of true and false facts, in particular including counterfactual examples, to avoid belief bias (also known as the "content effect"). The resulting dataset, called BaRDa, contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319 false statements. Testing on four GPT-series models, GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions.
Paper Structure (22 sections, 2 equations, 7 tables)