Table of Contents
Fetching ...

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Leonardo Bertolazzi, Albert Gatt, Raffaella Bernardi

TL;DR

The results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter can mitigate most reasoning biases while being consistent.

Abstract

The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as $\textit{content effects}$, avoid answering that $\textit{no conclusion follows}$, display human-like difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge, as well as ones with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter mitigates most reasoning biases without harming model consistency.

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

TL;DR

The results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter can mitigate most reasoning biases while being consistent.

Abstract

The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as , avoid answering that , display human-like difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge, as well as ones with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter mitigates most reasoning biases without harming model consistency.
Paper Structure (42 sections, 12 figures, 11 tables)

This paper contains 42 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: LLMs have difficulty with invalid inferences (Top); suffer from content effects (Middle); and struggle with longer chains of premises (Bottom). What is behind such weaknesses? Can LLMs learn to use only the form to draw deductively valid conclusions?
  • Figure 2: The building blocks: Moods (A, E, I, O) and figures (1-4). Their combination determines the conclusion, as illustrated by the AE2 schema.
  • Figure 3: Multiple-choice Task The model is given the premises and nine possible conclusions, and has to generate the correct one(s). $\textrm{ICL}_{out}$ is given in-context examples of different schemas than the one of the test example, while $\textrm{ICL}_{in}$ receives in-context examples of the same schema. The supervised fine-tuned model is trained on all schemas.
  • Figure 4: Triples of Terms. Ten triples of terms are used to create believable and unbelievable syllogisms. Each term represents a class of entities and the terms within each triple instantiate a hierarchy of increasing generality, from more specific to broader categories.
  • Figure 5: Task instruction prompt. We adapted our prompt from eisape:syst23 and added the additional strings “Read the passage of information thoroughly and select the correct answer from the available options. Read the premises thoroughly to ensure you know what the premise entails.” to make the task requirements more explicit.
  • ...and 7 more figures