Table of Contents
Fetching ...

Evaluating the Deductive Competence of Large Language Models

Spencer M. Seals, Valerie L. Shalin

TL;DR

The paper evaluates deductive reasoning in large language models using the Wason selection task, systematically varying problem content (arbitrary, shuffled, realistic) and problem formatting (classic, front, back, both) across several 7B-scale LLMs. It employs Domain Conditional PMI to quantify reasoning performance while applying mixed-effects modeling to control for stimulus variance. Results show only modest gains for realistic social content and no format-driven uplift to human-level performance, with numerous unexpected content-format interactions that diverge from human patterns and persist across models. The study concludes that LLM reasoning exhibits biases rooted in training data and that the Wason task remains a diagnostic tool for uncovering fundamental limitations in AI deductive capabilities, advocating cautious interpretation of reasoning improvements and the need for targeted evaluation designs.

Abstract

The development of highly fluent large language models (LLMs) has prompted increased interest in assessing their reasoning and problem-solving capabilities. We investigate whether several LLMs can solve a classic type of deductive reasoning problem from the cognitive science literature. The tested LLMs have limited abilities to solve these problems in their conventional form. We performed follow up experiments to investigate if changes to the presentation format and content improve model performance. We do find performance differences between conditions; however, they do not improve overall performance. Moreover, we find that performance interacts with presentation format and content in unexpected ways that differ from human performance. Overall, our results suggest that LLMs have unique reasoning biases that are only partially predicted from human reasoning performance and the human-generated language corpora that informs them.

Evaluating the Deductive Competence of Large Language Models

TL;DR

The paper evaluates deductive reasoning in large language models using the Wason selection task, systematically varying problem content (arbitrary, shuffled, realistic) and problem formatting (classic, front, back, both) across several 7B-scale LLMs. It employs Domain Conditional PMI to quantify reasoning performance while applying mixed-effects modeling to control for stimulus variance. Results show only modest gains for realistic social content and no format-driven uplift to human-level performance, with numerous unexpected content-format interactions that diverge from human patterns and persist across models. The study concludes that LLM reasoning exhibits biases rooted in training data and that the Wason task remains a diagnostic tool for uncovering fundamental limitations in AI deductive capabilities, advocating cautious interpretation of reasoning improvements and the need for targeted evaluation designs.

Abstract

The development of highly fluent large language models (LLMs) has prompted increased interest in assessing their reasoning and problem-solving capabilities. We investigate whether several LLMs can solve a classic type of deductive reasoning problem from the cognitive science literature. The tested LLMs have limited abilities to solve these problems in their conventional form. We performed follow up experiments to investigate if changes to the presentation format and content improve model performance. We do find performance differences between conditions; however, they do not improve overall performance. Moreover, we find that performance interacts with presentation format and content in unexpected ways that differ from human performance. Overall, our results suggest that LLMs have unique reasoning biases that are only partially predicted from human reasoning performance and the human-generated language corpora that informs them.
Paper Structure (30 sections, 1 equation, 7 figures, 9 tables)

This paper contains 30 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Model performance by content type for arbitrary (AR), shuffled (SH), and realistic (RE) rules. RE contains both social and non-social rules. Error bars represent 95 % confidence intervals. We do not find effects for LLM or familiarity, thus performance is collapsed. Relative to arbitrary content, most models result in a benefit for realistic rules, with mixed influences of shuffling.
  • Figure 2: Interaction between content type and social rule status for Analysis 1. Content type: arbitrary (AR), shuffled (SH), or realistic (RE) rules. The realistic category contains social rules and non-social rules. Social rule status: social rule (SR) or non- social rule (NSR) problems. We do not find effects for LLM or familiarity, thus performance is collapsed.
  • Figure 3: Performance across all models for arbitrary (AR), shuffled (SH), and realistic (RE) rules. The realistic category contains both social rules and non-social rules. We collapse across LLM and familiarity.
  • Figure 4: Interaction between presentation format (classic, front, back, or both), content type (shuffled (SH) or realistic (RE)), and social rule status (social rule or non social rule) broken out by presentation format.
  • Figure 5: Interaction between presentation format (classic, front, back, or both), content type (shuffled (SH) or realistic (RE)), and social rule status (social rule or non social rule) broken out by content type.
  • ...and 2 more figures