BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

Mehran Kazemi; Quan Yuan; Deepti Bhatia; Najoung Kim; Xin Xu; Vaiva Imbrasaite; Deepak Ramachandran

BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

Mehran Kazemi, Quan Yuan, Deepti Bhatia, Najoung Kim, Xin Xu, Vaiva Imbrasaite, Deepak Ramachandran

TL;DR

This work addresses the challenge of reasoning with contradictory natural-language information by formulating it as defeasible reasoning with source preferences. It introduces BoardgameQA, a synthetic dataset built around a board-game narrative that couples defeasible theories with multi-hop questions and implicit background knowledge, including incomplete information and distractors. Through extensive evaluations of varied LM architectures and training paradigms, the authors demonstrate a substantial gap in current models' ability to resolve conflicts, with performance deteriorating as conflicts or reasoning depth increase, and even correct label predictions not guaranteeing valid proofs. The results highlight the need for new methods to endow LMs with robust conflict-resolution and reasoning capabilities applicable to real-world, conflicting information settings.

Abstract

Automated reasoning with unstructured natural text is a key requirement for many potential applications of NLP and for developing robust AI systems. Recently, Language Models (LMs) have demonstrated complex reasoning capacities even without any finetuning. However, existing evaluation for automated reasoning assumes access to a consistent and coherent set of information over which models reason. When reasoning in the real-world, the available information is frequently inconsistent or contradictory, and therefore models need to be equipped with a strategy to resolve such conflicts when they arise. One widely-applicable way of resolving conflicts is to impose preferences over information sources (e.g., based on source credibility or information recency) and adopt the source with higher preference. In this paper, we formulate the problem of reasoning with contradictory information guided by preferences over sources as the classical problem of defeasible reasoning, and develop a dataset called BoardgameQA for measuring the reasoning capacity of LMs in this setting. BoardgameQA also incorporates reasoning with implicit background knowledge, to better reflect reasoning problems in downstream applications. We benchmark various LMs on BoardgameQA and the results reveal a significant gap in the reasoning capacity of state-of-the-art LMs on this problem, showing that reasoning with conflicting information does not surface out-of-the-box in LMs. While performance can be improved with finetuning, it nevertheless remains poor.

BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

TL;DR

Abstract

Paper Structure (19 sections, 23 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 23 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Background and Notation
The BoardgameQA Dataset
Experiments
Can LMs Reason with Contradictory Inputs?
Does Correct Label Prediction Mean Correct Proof?
Do Conflicts Make Reasoning More Difficult?
Which Conflict Type is More Difficult to Resolve?
Does Information Incompleteness Make Reasoning More Difficult?
Do Distractors Make Reasoning More Difficult?
Conclusion
More Experimental Results and Analysis
Experimental Details
BoardgameQA Details
...and 4 more sections

Figures (23)

Figure 1: A reasoning problem with contradictory information (conflict resolved based on recency).
Figure 2: A sample example from BoardgameQA that requires one hop of reasoning. The text in violet highlights conflict resolution and the text in blue highlights the missing information.
Figure 3: A comparison of BoardgameQA with ProofWriter tafjord2021proofwriter and PrOntoQA saparov2023language in terms of average length of examples and average number of unique tokens per example on depth 3 of the datasets.
Figure 4: The model performances on depths 1--3 of the BoardgameQA dataset. Many models struggle on this dataset, especially with higher depths.
Figure 5: Proof accuracy metrics for various models on depth 2 of the dataset, when the label is predicted correctly.
...and 18 more figures

Theorems & Definitions (1)

Example 3.1

BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

TL;DR

Abstract

BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information

Authors

TL;DR

Abstract

Table of Contents

Figures (23)

Theorems & Definitions (1)