Table of Contents
Fetching ...

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

Zihao Xu, Junchen Ding, Yiling Lou, Kun Zhang, Dong Gong, Yuekang Li

TL;DR

SmartyPat introduces an automated, Prolog-based framework to generate and evaluate logically fallacious statements, producing SmartyPat-Bench (502 real-world examples) and SmartyPat-Bench-Augmented (synthetic, yet high-quality data). By coupling logic-programming test oracles with neural generation, the approach yields controllable, diverse fallacies and enables rigorous evaluation of LLMs on fallacy existence and categorization. Across nine state-of-the-art LLMs, the study finds that excessive reasoning steps hinder fallacy detection while structured reasoning enhances categorization, with non-reasoning models excelling at detection and reasoning models performing better at categorization. The work provides a practical, extensible benchmark and demonstrates the value of neurosymbolic generation for trustworthy LLM evaluation, with implications for testability and reliability in real-world reasoning tasks.

Abstract

Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

TL;DR

SmartyPat introduces an automated, Prolog-based framework to generate and evaluate logically fallacious statements, producing SmartyPat-Bench (502 real-world examples) and SmartyPat-Bench-Augmented (synthetic, yet high-quality data). By coupling logic-programming test oracles with neural generation, the approach yields controllable, diverse fallacies and enables rigorous evaluation of LLMs on fallacy existence and categorization. Across nine state-of-the-art LLMs, the study finds that excessive reasoning steps hinder fallacy detection while structured reasoning enhances categorization, with non-reasoning models excelling at detection and reasoning models performing better at categorization. The work provides a practical, extensible benchmark and demonstrates the value of neurosymbolic generation for trustworthy LLM evaluation, with implications for testability and reliability in real-world reasoning tasks.

Abstract

Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

Paper Structure

This paper contains 47 sections, 20 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: The score distribution of the three methods across different types of logical fallacies. *Direct means FallacyGen-Direct, Prolog means FallacyGen-Direct.More score 3 is better.
  • Figure 2: F1 score(higher better), sorted by F1 score in descending order. Claude 3.7 Ex means Claude 3.7 Extended Thinking.
  • Figure 3: LLM accuracy (darker is better) in identifying fallacies from SmartyPat-Bench (without *) and SmartyPat-Bench-Augmented (with *).
  • Figure 4: Fallacy label scores for selected LLMs, sorted in descending order (Close to the Left is better).
  • Figure 5: Examples of testcases in different benchmarks.*Each grey-colored block is a single testcase.
  • ...and 3 more figures

Theorems & Definitions (14)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Definition 9
  • Definition 10
  • ...and 4 more