Table of Contents
Fetching ...

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Houman Mehrafarin, Arash Eshghi, Ioannis Konstas

TL;DR

Investigating the transitive reasoning capabilities of two distinct LLM architectures, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle reveals that Flan-T5 shows more resilience to experiments, suggesting that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets.

Abstract

Evaluating Large Language Models (LLMs) on reasoning benchmarks demonstrates their ability to solve compositional questions. However, little is known of whether these models engage in genuine logical reasoning or simply rely on implicit cues to generate answers. In this paper, we investigate the transitive reasoning capabilities of two distinct LLM architectures, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle. We controlled for potential cues that might influence the models' performance, including (a) word/phrase overlaps across sections of test input; (b) models' inherent knowledge during pre-training or fine-tuning; and (c) Named Entities. Our findings reveal that while both models leverage (a), Flan-T5 shows more resilience to experiments (b and c), having less variance than LLaMA 2. This suggests that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets, a hypothesis we leave to future work.

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

TL;DR

Investigating the transitive reasoning capabilities of two distinct LLM architectures, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle reveals that Flan-T5 shows more resilience to experiments, suggesting that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets.

Abstract

Evaluating Large Language Models (LLMs) on reasoning benchmarks demonstrates their ability to solve compositional questions. However, little is known of whether these models engage in genuine logical reasoning or simply rely on implicit cues to generate answers. In this paper, we investigate the transitive reasoning capabilities of two distinct LLM architectures, LLaMA 2 and Flan-T5, by manipulating facts within two compositional datasets: QASC and Bamboogle. We controlled for potential cues that might influence the models' performance, including (a) word/phrase overlaps across sections of test input; (b) models' inherent knowledge during pre-training or fine-tuning; and (c) Named Entities. Our findings reveal that while both models leverage (a), Flan-T5 shows more resilience to experiments (b and c), having less variance than LLaMA 2. This suggests that models may develop an understanding of transitivity through fine-tuning on knowingly relevant datasets, a hypothesis we leave to future work.

Paper Structure

This paper contains 38 sections, 1 equation, 2 figures, 9 tables.

Figures (2)

  • Figure 1: (a) 3-shot In-Context Learning (ICL) prompt for the compositional question answering task. The prompt begins with the instruction "Follow the demonstrations below to answer the given question" followed by 3 demonstrations. Each demonstration consists of a "Context" with a question, optionally a set of multiple-choice (MC) answers for the QASC dataset qasc, two supporting facts (fact 1, fact 2), and a set of "Steps" including a "Deduction" and the correct answer. The test query contains only the "Context" and the LLM needs to generate the "Steps". (b) We perform a series of manipulations to either of the facts by shuffling words, removing overlapping keywords, and gibbering Named Entities to control for different sources of exploitation of cues in the input by the models.
  • Figure 2: Accuracy of models prompted with the Shuffled Facts and Full diagnostic prompts. Results show that models are insensitive to word order within facts.