Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism
Shi Zong, Jimmy Lin
TL;DR
This work systematically reviews how LLMs perform on categorical syllogism reasoning, analyzing diverse datasets and the coverage of mood/figure configurations. It identifies quantifier interpretation as a key bottleneck and shows that crowd-generated datasets skew toward a limited set of configurations, while template-based datasets offer broader variation. The authors propose a two-pronged path forward: improve dataset design (clear existential import, complete annotations, ordinary-argument samples) and explore both external symbolic reasoning and internal model enhancements to boost reliability. By combining logician-level analyses with NLP methodology, the study motivates interdisciplinary collaboration to build fairer, more informative benchmarks for logical inference in LLMs.
Abstract
There have been a huge number of benchmarks proposed to evaluate how large language models (LLMs) behave for logic inference tasks. However, it remains an open question how to properly evaluate this ability. In this paper, we provide a systematic overview of prior works on the logical reasoning ability of LLMs for analyzing categorical syllogisms. We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective and then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets. Our results indicate that compared to template-based synthetic datasets, crowdsourcing approaches normally sacrifice the coverage of configurations (i.e., mood and figure) of categorical syllogisms for more language variations, thus bringing challenges to fully testing LLMs under different situations. We then proceed to summarize the findings and observations for the performances of LLMs to infer the validity of syllogisms from the current literature. The error rate breakdown analyses suggest that the interpretation of the quantifiers seems to be the current bottleneck that limits the performances of the LLMs and is thus worth more attention. Finally, we discuss several points that might be worth considering when researchers plan on the future release of categorical syllogism datasets. We hope our work will not only provide a timely review of the current literature regarding categorical syllogisms, but also motivate more interdisciplinary research between communities, specifically computational linguists and logicians.
