Table of Contents
Fetching ...

Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism

Shi Zong, Jimmy Lin

TL;DR

This work systematically reviews how LLMs perform on categorical syllogism reasoning, analyzing diverse datasets and the coverage of mood/figure configurations. It identifies quantifier interpretation as a key bottleneck and shows that crowd-generated datasets skew toward a limited set of configurations, while template-based datasets offer broader variation. The authors propose a two-pronged path forward: improve dataset design (clear existential import, complete annotations, ordinary-argument samples) and explore both external symbolic reasoning and internal model enhancements to boost reliability. By combining logician-level analyses with NLP methodology, the study motivates interdisciplinary collaboration to build fairer, more informative benchmarks for logical inference in LLMs.

Abstract

There have been a huge number of benchmarks proposed to evaluate how large language models (LLMs) behave for logic inference tasks. However, it remains an open question how to properly evaluate this ability. In this paper, we provide a systematic overview of prior works on the logical reasoning ability of LLMs for analyzing categorical syllogisms. We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective and then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets. Our results indicate that compared to template-based synthetic datasets, crowdsourcing approaches normally sacrifice the coverage of configurations (i.e., mood and figure) of categorical syllogisms for more language variations, thus bringing challenges to fully testing LLMs under different situations. We then proceed to summarize the findings and observations for the performances of LLMs to infer the validity of syllogisms from the current literature. The error rate breakdown analyses suggest that the interpretation of the quantifiers seems to be the current bottleneck that limits the performances of the LLMs and is thus worth more attention. Finally, we discuss several points that might be worth considering when researchers plan on the future release of categorical syllogism datasets. We hope our work will not only provide a timely review of the current literature regarding categorical syllogisms, but also motivate more interdisciplinary research between communities, specifically computational linguists and logicians.

Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism

TL;DR

This work systematically reviews how LLMs perform on categorical syllogism reasoning, analyzing diverse datasets and the coverage of mood/figure configurations. It identifies quantifier interpretation as a key bottleneck and shows that crowd-generated datasets skew toward a limited set of configurations, while template-based datasets offer broader variation. The authors propose a two-pronged path forward: improve dataset design (clear existential import, complete annotations, ordinary-argument samples) and explore both external symbolic reasoning and internal model enhancements to boost reliability. By combining logician-level analyses with NLP methodology, the study motivates interdisciplinary collaboration to build fairer, more informative benchmarks for logical inference in LLMs.

Abstract

There have been a huge number of benchmarks proposed to evaluate how large language models (LLMs) behave for logic inference tasks. However, it remains an open question how to properly evaluate this ability. In this paper, we provide a systematic overview of prior works on the logical reasoning ability of LLMs for analyzing categorical syllogisms. We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective and then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets. Our results indicate that compared to template-based synthetic datasets, crowdsourcing approaches normally sacrifice the coverage of configurations (i.e., mood and figure) of categorical syllogisms for more language variations, thus bringing challenges to fully testing LLMs under different situations. We then proceed to summarize the findings and observations for the performances of LLMs to infer the validity of syllogisms from the current literature. The error rate breakdown analyses suggest that the interpretation of the quantifiers seems to be the current bottleneck that limits the performances of the LLMs and is thus worth more attention. Finally, we discuss several points that might be worth considering when researchers plan on the future release of categorical syllogism datasets. We hope our work will not only provide a timely review of the current literature regarding categorical syllogisms, but also motivate more interdisciplinary research between communities, specifically computational linguists and logicians.
Paper Structure (45 sections, 2 figures, 6 tables)

This paper contains 45 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Error rate ($\downarrow$) of GPT-4 and GPT-4o using zero-shot chain-of-thoughts. (a) and (b): Breakdowns on all 256 configurations of categorical syllogisms in the Reasoning dataset, calculated over 10 different combinations. A white block indicates an error rate of 0 (thus 100% accuracy) in that specific configuration. (c): Breakdowns by configurations in the SylloFigure and Avicenna datasets. We mark the predicted configuration as "N/A" if it does not pass the cross-check discussed in \ref{['sec:tools']}.
  • Figure 2: Percentage breakdowns of the correct propositions within each predicted proposition type (by GPT-4). 156 propositions (last row) could not be classified and we can not automatically verify the correctness of predictions without human efforts (last column).