Evaluating the Deductive Competence of Large Language Models
Spencer M. Seals, Valerie L. Shalin
TL;DR
The paper evaluates deductive reasoning in large language models using the Wason selection task, systematically varying problem content (arbitrary, shuffled, realistic) and problem formatting (classic, front, back, both) across several 7B-scale LLMs. It employs Domain Conditional PMI to quantify reasoning performance while applying mixed-effects modeling to control for stimulus variance. Results show only modest gains for realistic social content and no format-driven uplift to human-level performance, with numerous unexpected content-format interactions that diverge from human patterns and persist across models. The study concludes that LLM reasoning exhibits biases rooted in training data and that the Wason task remains a diagnostic tool for uncovering fundamental limitations in AI deductive capabilities, advocating cautious interpretation of reasoning improvements and the need for targeted evaluation designs.
Abstract
The development of highly fluent large language models (LLMs) has prompted increased interest in assessing their reasoning and problem-solving capabilities. We investigate whether several LLMs can solve a classic type of deductive reasoning problem from the cognitive science literature. The tested LLMs have limited abilities to solve these problems in their conventional form. We performed follow up experiments to investigate if changes to the presentation format and content improve model performance. We do find performance differences between conditions; however, they do not improve overall performance. Moreover, we find that performance interacts with presentation format and content in unexpected ways that differ from human performance. Overall, our results suggest that LLMs have unique reasoning biases that are only partially predicted from human reasoning performance and the human-generated language corpora that informs them.
