Table of Contents
Fetching ...

Belief in the Machine: Investigating Epistemological Blind Spots of Language Models

Mirac Suzgun, Tayfun Gur, Federico Bianchi, Daniel E. Ho, Thomas Icard, Dan Jurafsky, James Zou

TL;DR

This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks, and reveals key limitations.

Abstract

As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios, particularly in belief-related tasks. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data, which raises concerns for applications in healthcare and counseling, where engaging with a person's beliefs is critical. Third, we identify a salient bias in how LMs process first-person versus third-person beliefs, performing better on third-person tasks (80.7%) compared to first-person tasks (54.4%). Fourth, LMs lack a robust understanding of the factive nature of knowledge, namely, that knowledge inherently requires truth. Fifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the deeper reasoning. These findings highlight significant concerns about current LMs' ability to reason about truth, belief, and knowledge while emphasizing the need for advancements in these areas before broad deployment in critical sectors.

Belief in the Machine: Investigating Epistemological Blind Spots of Language Models

TL;DR

This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks, and reveals key limitations.

Abstract

As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios, particularly in belief-related tasks. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data, which raises concerns for applications in healthcare and counseling, where engaging with a person's beliefs is critical. Third, we identify a salient bias in how LMs process first-person versus third-person beliefs, performing better on third-person tasks (80.7%) compared to first-person tasks (54.4%). Fourth, LMs lack a robust understanding of the factive nature of knowledge, namely, that knowledge inherently requires truth. Fifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the deeper reasoning. These findings highlight significant concerns about current LMs' ability to reason about truth, belief, and knowledge while emphasizing the need for advancements in these areas before broad deployment in critical sectors.

Paper Structure

This paper contains 16 sections, 4 equations, 26 figures, 4 tables.

Figures (26)

  • Figure 1: Modern language models have a systematic difficulty in verifying (left) and confirming (right) personal beliefs especially when those beliefs challenge facts or their training data.
  • Figure 2: GPT-4o and other models tend to have difficulty in affirming first-person beliefs involving new facts.
  • Figure 3: Language models such as GPT-4 fail to consistently affirm and acknowledge personal beliefs, especially when those are expressed in the first-person and not consistent with the factual knowledge learned during training. Despite the user clearly stating their belief, the model occasionally provides incorrect or uncertain responses.
  • Figure 4: Sample true (factual) and false statements from the KaBLE dataset. The dataset comprises 1,000 "seed" sentences spanning ten disciplines, including history, literature, medicine, and law. Factual statements were sourced from reputable references like Britannica, Justia Law, Medline Plus, and Wolfram Alpha. Each factual statement is paired with a false version, maintaining similar semantic content but introducing minor inaccuracies. These sentence pairs form the basis for generating questions across thirteen epistemological tasks detailed in Section \ref{['sec:experimental-setup']}.
  • Figure 5: Left: The prompt template used for the input queries for language models. Right: An input example for the confirmation of personal belief task. (The US does not have an official language, but the answer is (A).)
  • ...and 21 more figures