Table of Contents
Fetching ...

Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them

Guanyu Chen, Peiyang Wang, Yizhou Jiang, Yuqian Liu, Chujie Zhao, Ying Fang, Tianren Zhang, Feng Chen

TL;DR

This work introduces Misleading Fine-Tuning (MisFT) to probe whether large language models rely on abstract rule-based reasoning or memorization. By training on deliberately contradicting rules for arithmetic and logic and testing generalization to unseen domains (including vision-language scenarios), the authors demonstrate that larger models can generalize the misleading rules, suggesting an internal abstraction-then-reasoning mechanism. The results span number and operator overloading, logic overloading, and multimodal arithmetic, and identify deep model layers as key sites for rule mapping. The study offers a data-contamination-free framework for evaluating AI reasoning and provides evidence for genuine abstract reasoning capabilities in LLMs, with implications for model evaluation and interpretability.

Abstract

Large language models (LLMs) have been able to perform various forms of reasoning tasks in a wide range of scenarios, but are they truly engaging in task abstraction and rule-based reasoning beyond mere memorization? To answer this question, we propose a novel experimental approach, Misleading Fine-Tuning (MisFT), to examine whether LLMs perform abstract reasoning by altering their original understanding of fundamental rules. In particular, by constructing datasets with math expressions or logical formulas that contradict correct principles, we fine-tune the model to learn those contradictory rules and assess its generalization ability on unseen test domains. Through a series of experiments, we find that current LLMs are capable of applying contradictory rules to solve practical math word problems and natural language reasoning tasks, implying the presence of an internal mechanism in LLMs that abstracts before reasoning.

Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them

TL;DR

This work introduces Misleading Fine-Tuning (MisFT) to probe whether large language models rely on abstract rule-based reasoning or memorization. By training on deliberately contradicting rules for arithmetic and logic and testing generalization to unseen domains (including vision-language scenarios), the authors demonstrate that larger models can generalize the misleading rules, suggesting an internal abstraction-then-reasoning mechanism. The results span number and operator overloading, logic overloading, and multimodal arithmetic, and identify deep model layers as key sites for rule mapping. The study offers a data-contamination-free framework for evaluating AI reasoning and provides evidence for genuine abstract reasoning capabilities in LLMs, with implications for model evaluation and interpretability.

Abstract

Large language models (LLMs) have been able to perform various forms of reasoning tasks in a wide range of scenarios, but are they truly engaging in task abstraction and rule-based reasoning beyond mere memorization? To answer this question, we propose a novel experimental approach, Misleading Fine-Tuning (MisFT), to examine whether LLMs perform abstract reasoning by altering their original understanding of fundamental rules. In particular, by constructing datasets with math expressions or logical formulas that contradict correct principles, we fine-tune the model to learn those contradictory rules and assess its generalization ability on unseen test domains. Through a series of experiments, we find that current LLMs are capable of applying contradictory rules to solve practical math word problems and natural language reasoning tasks, implying the presence of an internal mechanism in LLMs that abstracts before reasoning.

Paper Structure

This paper contains 19 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: An illustration of Misleading Fine-Tuning. Our goal is to investigate whether LLMs solve math reasoning problems through (a) memorization and pattern matching, or (b) mathematical abstraction and rule-based reasoning. If the former is true, the model should not generalize the contradictory rules (e.g., "$4+6=12$") to the math word problem domain that is absent in fine-tuning. Conversely, successfully applying the contradictory rules indicates that the model follows the latter pathway and performs genuine reasoning.
  • Figure 2: Comparison between counterfactual evaluation and the proposed misleading fine-tuning (MisFT).
  • Figure 3: Results of MisFT for number overloading (top two subplots) and operator overloading (bottom two subplots). Note that the accuracy in the figure starts at $60\%$.
  • Figure 4: Results of complex operator overloading.
  • Figure 5: Results of logic overloading.
  • ...and 4 more figures