Table of Contents
Fetching ...

Conditional and Modal Reasoning in Large Language Models

Wesley H. Holliday, Matthew Mandelkern, Cedegao E. Zhang

TL;DR

The extent to which twenty-nine LLMs are able to distinguish logically correct inferences from logically fallacious ones is probed, highlighting gaps in basic logical reasoning in today’s LLMs.

Abstract

The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in AI and cognitive science. In this paper, we probe the extent to which twenty-nine LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., 'If Ann has a queen, then Bob has a jack') and epistemic modals (e.g., 'Ann might have an ace', 'Bob must have a king'). These inferences have been of special interest to logicians, philosophers, and linguists, since they play a central role in the fundamental human ability to reason about distal possibilities. Assessing LLMs on these inferences is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. All the LLMs we tested make some basic mistakes with conditionals or modals, though zero-shot chain-of-thought prompting helps them make fewer mistakes. Even the best performing LLMs make basic errors in modal reasoning, display logically inconsistent judgments across inference patterns involving epistemic modals and conditionals, and give answers about complex conditional inferences that do not match reported human judgments. These results highlight gaps in basic logical reasoning in today's LLMs.

Conditional and Modal Reasoning in Large Language Models

TL;DR

The extent to which twenty-nine LLMs are able to distinguish logically correct inferences from logically fallacious ones is probed, highlighting gaps in basic logical reasoning in today’s LLMs.

Abstract

The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in AI and cognitive science. In this paper, we probe the extent to which twenty-nine LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., 'If Ann has a queen, then Bob has a jack') and epistemic modals (e.g., 'Ann might have an ace', 'Bob must have a king'). These inferences have been of special interest to logicians, philosophers, and linguists, since they play a central role in the fundamental human ability to reason about distal possibilities. Assessing LLMs on these inferences is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. All the LLMs we tested make some basic mistakes with conditionals or modals, though zero-shot chain-of-thought prompting helps them make fewer mistakes. Even the best performing LLMs make basic errors in modal reasoning, display logically inconsistent judgments across inference patterns involving epistemic modals and conditionals, and give answers about complex conditional inferences that do not match reported human judgments. These results highlight gaps in basic logical reasoning in today's LLMs.
Paper Structure (45 sections, 8 figures, 2 tables)

This paper contains 45 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Summary of performance on the uncontroversial logical inference patterns discussed in § \ref{['Infs']}. Guessing accuracy is 50%. Larger models generally perform better, and most models show clear weakness at this task.
  • Figure 2: Zero-shot responses for MTmu (above) and MTmi (below) show inconsistency for many models. All error bars, including in subsequent figures, represent 95% confidence intervals.
  • Figure 3: Zero-shot responses for DSmu (above) and DSmi (below) show inconsistency for many models.
  • Figure 4: Percentage of responses that were jointly consistent when we asked leading models about DSmu, MiN, and DSmi in the same context window, in one of the six possible orders. Each dot represents such an order. The results show strong sensitivity to question order, which is highly undesirable.
  • Figure 5: Responses for CMP, zero-shot (above) and chain-of-thought (below); LLMs were asked whether the inference preserved likelihood, i.e., if $q\to r$ must be likely when $p\to(q\to r)$ is certain and $p$ is likely.
  • ...and 3 more figures