Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning
Andrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss, Vincent Wang-Mascianica, Philipp Koralus
TL;DR
This study asks whether LLM errors on controlled reasoning tasks follow human fallacy patterns by applying Erotetic Theory of Reasoning (ETR) with the PyETR toolkit to generate 383 problems evaluated over 38 models. It uncovers a significant positive association between a capability proxy (Chatbot Arena Elo) and the proportion of incorrect answers that are ETR-predicted fallacies, and reveals human-like premise-order effects in LLM responses. The authors provide the first substantial open-source application of PyETR to LLMs, creating a contamination-resistant, synthetic reasoning testbed that centers on error composition rather than raw error rate. The findings suggest that increased capability may shift error types toward human-like fallacies, highlighting a need for theory-grounded benchmarks and targeted interventions to improve reasoning reliability in high-capacity models.
Abstract
We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open-source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR-predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model's incorrect answers are ETR-predicted fallacies $(ρ=0.360, p=0.0265)$, while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects. Methodologically, PyETR provides an open-source pipeline for unbounded, synthetic, contamination-resistant reasoning tests linked to a cognitive theory, enabling analyses that focus on error composition rather than error rate.
