Table of Contents
Fetching ...

Belief Revision: The Adaptability of Large Language Models Reasoning

Bryan Wilie, Samuel Cahyawijaya, Etsuko Ishii, Junxian He, Pascale Fung

TL;DR

It is found that LMs generally struggle to appropriately revise their beliefs in response to new information, and models adept at updating often underperformed in scenarios without necessary updates, highlighting a critical trade-off.

Abstract

The capability to reason from text is crucial for real-world NLP applications. Real-world scenarios often involve incomplete or evolving data. In response, individuals update their beliefs and understandings accordingly. However, most existing evaluations assume that language models (LMs) operate with consistent information. We introduce Belief-R, a new dataset designed to test LMs' belief revision ability when presented with new evidence. Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning ($ΔR$) framework. Belief-R features sequences of premises designed to simulate scenarios where additional information could necessitate prior conclusions drawn by LMs. We evaluate $\sim$30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information. Further, models adept at updating often underperformed in scenarios without necessary updates, highlighting a critical trade-off. These insights underscore the importance of improving LMs' adaptiveness to changing information, a step toward more reliable AI systems.

Belief Revision: The Adaptability of Large Language Models Reasoning

TL;DR

It is found that LMs generally struggle to appropriately revise their beliefs in response to new information, and models adept at updating often underperformed in scenarios without necessary updates, highlighting a critical trade-off.

Abstract

The capability to reason from text is crucial for real-world NLP applications. Real-world scenarios often involve incomplete or evolving data. In response, individuals update their beliefs and understandings accordingly. However, most existing evaluations assume that language models (LMs) operate with consistent information. We introduce Belief-R, a new dataset designed to test LMs' belief revision ability when presented with new evidence. Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning () framework. Belief-R features sequences of premises designed to simulate scenarios where additional information could necessitate prior conclusions drawn by LMs. We evaluate 30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information. Further, models adept at updating often underperformed in scenarios without necessary updates, highlighting a critical trade-off. These insights underscore the importance of improving LMs' adaptiveness to changing information, a step toward more reliable AI systems.
Paper Structure (37 sections, 9 figures, 4 tables)

This paper contains 37 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Belief revision allows reasoners to update their belief based on the new provided evidence. Such ability is necessary to enable better logical reasoning on the case of defeasible inference.
  • Figure 2: Human reasoning adapts based on new information, leading us to adjust our prior beliefs. Here, the additional condition (left) casts doubt on prior modus ponens conclusion in (a). People may consider that certain other conditions necessary for this conclusion to hold, i.e., the library must remain open. In contrast, the alternative argument (right) does not affect the modus ponens inference pathway, thus prior conclusion could still hold.
  • Figure 3: Evaluation on basic logical inference capabilities in Belief-R on various LLMs sorted by the #parameters. Pre-trained LLMs with $\geq$6B parameters achieves adequate accuracy ($\geq$75%), while instruction-tuned LLMs achieve the same performance on much smaller scale with $\geq$2.7B parameters.
  • Figure 4: BREU score evaluation on belief revision capabilities in Belief-R on various models sorted by the BREU score. While larger-scale LLMs tend to achieve higher BREU score, the performance is far below their basic logical inferences at $t$ (Acc@t), showcasing limited capability of LLMs in performing belief revision.
  • Figure 5: Performance comparisons dissected across various aspects covering distinction on modus ponens and modus tollens, on different effect entities, and on different prompt methods.
  • ...and 4 more figures