Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Philipp Mondorf; Barbara Plank

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Philipp Mondorf, Barbara Plank

TL;DR

The paper investigates whether large language models (LLMs) employ inferential strategies in propositional logic that resemble human reasoning. Using a cross-model evaluation across Zephyr-7B-$\beta$, Mistral-7B-Instruct, and LLaMA-2 variants with zero-shot chain-of-thought prompts, the authors analyze both the strategies used and the logical soundness of the reasoning through manual annotation. They find that LLMs adopt human-like strategies such as supposition following and chain construction, with model-family-dependent preferences and only a moderate link between final correctness and reasoning quality. The work highlights that final answer accuracy alone is insufficient to judge reasoning capabilities and calls for more nuanced evaluation frameworks and faithfulness checks, while outlining directions for larger datasets and automatic strategy classification to advance understanding of LLM reasoning.

Abstract

Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like $\textit{supposition following}$ or $\textit{chain construction}$. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model's accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

TL;DR

The paper investigates whether large language models (LLMs) employ inferential strategies in propositional logic that resemble human reasoning. Using a cross-model evaluation across Zephyr-7B-

, Mistral-7B-Instruct, and LLaMA-2 variants with zero-shot chain-of-thought prompts, the authors analyze both the strategies used and the logical soundness of the reasoning through manual annotation. They find that LLMs adopt human-like strategies such as supposition following and chain construction, with model-family-dependent preferences and only a moderate link between final correctness and reasoning quality. The work highlights that final answer accuracy alone is insufficient to judge reasoning capabilities and calls for more nuanced evaluation frameworks and faithfulness checks, while outlining directions for larger datasets and automatic strategy classification to advance understanding of LLM reasoning.

Abstract

. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model's accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.

Paper Structure (21 sections, 19 figures, 4 tables)

This paper contains 21 sections, 19 figures, 4 tables.

Introduction
Strategies in Propositional Reasoning
Experimental Setup
Results and Analysis
Quantitative Analysis
Qualitative Analysis
Related Work
Conclusion
Limitations
Additional Experimental Details
Task Prompts
Annotator Instructions
Inter-Annotator Agreement
Model Details
Additional Quantitative Results
...and 6 more sections

Figures (19)

Figure 1: Given the propositional reasoning prompt (top box), the LLM shows two different inferential strategies: supposition following (left) and chain construction (right), see Section \ref{['sec:Strategies-Propositional-Reasoning']} for strategy details. Note that both answers are only partially correct, as the exclusive disjunction has only been proven for one of the cases (pink and not black). Model responses are generated by LLaMA-2-Chat-70B across two random seeds.
Figure 2: An example for each of the five inferential strategies identified by van_der_henst_strategies_2002 (to the left of the dashed vertical line) that human reasoners employ when solving tasks of propositional logic. Each strategy is illustrated by a single example adopted from the transcribed recordings published by the original study. In addition, we provide an example of the symbolic strategy occasionally encountered in LLMs (to the right of the dashed line). "Iff" denotes a biconditional, while "xor" indicates an exclusive disjunction.
Figure 3: The response (lower left box) of LLaMA-2-70B to problem \ref{['fig:appendix_a_task_prompt']} (top box) of the problem set, demonstrating yellow!40yellow chain construction. The model correctly constructs a chain of conditionals (highlighted in yellow within the model's response) based on the premises, leading from the antecedent of the final conclusion to its consequent. Comments made by the annotators are presented in the adjacent right panel.
Figure 4: Instances where models generate sound reasoning traces that logically follow from the problem statement. For each inferential strategy, the ratio of sound reasoning traces (represented by the filled portion) to the overall application of that strategy (denoted by the unfilled bar) is depicted. Ratios are expressed as percentages above the corresponding filled section. Note that LLaMA-2-7B is not displayed as it does not exhibit sound reasoning.
Figure 5: The task prompt (upper yellow box) as well as statements and conclusion for each propositional logic problem (lower gray boxes). In the task prompt, the placeholder "<statements and conclusion from below>" is replaced with the actual statements and conclusion relevant to each problem. To enhance readability, we employ abbreviations within the problem statements. In the actual prompt, "colorA iff colorB" is replaced by "There is a colorA marble in the box if and only if there is a colorB marble in the box". Similarly, "colorA xor colorB" denotes "Either there is a colorA marble in the box or else there is a colorB marble in the box, but not both". Lastly, "If colorA then colorB" stands for "If there is a colorA marble in the box then there is a colorB marble in the box".
...and 14 more figures

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

TL;DR

Abstract

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (19)