Table of Contents
Fetching ...

MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models

Jiachun Li, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao

TL;DR

MIRAGE introduces a synthetic, flexible dataset to evaluate both inductive and deductive reasoning in large language models, addressing prior work's limited scope and data rigidity. It reveals that LLMs are not reliable rule-based inductive reasoners; instead, they leverage neighbor-based cues from observed facts, performing strong deductions within localized regions of the input space. The dataset construction uses a meta-rule library with five atomic operations to generate facts, varying dimension $D$ and observation count $N$, and transforms these into diverse tasks across LT, RP, CG, and ST. Across multiple models and prompting regimes, MIRAGE shows limited gains from advanced reasoning prompts and highlights the importance of neighborhood similarity and form-related transfer in inductive reasoning. These findings have implications for designing evaluation protocols and guiding future methods to bolster robust inductive capabilities in LLMs.

Abstract

Inductive reasoning is an essential capability for large language models (LLMs) to achieve higher intelligence, which requires the model to generalize rules from observed facts and then apply them to unseen examples. We present MIRAGE, a synthetic dataset that addresses the limitations of previous work, specifically the lack of comprehensive evaluation and flexible test data. In it, we evaluate LLMs' capabilities in both the inductive and deductive stages, allowing for flexible variation in input distribution, task scenario, and task difficulty to analyze the factors influencing LLMs' inductive reasoning. Based on these multi-faceted evaluations, we demonstrate that the LLM is a poor rule-based reasoner. In many cases, when conducting inductive reasoning, they do not rely on a correct rule to answer the unseen case. From the perspectives of different prompting methods, observation numbers, and task forms, models tend to consistently conduct correct deduction without correct inductive rules. Besides, we find that LLMs are good neighbor-based reasoners. In the inductive reasoning process, the model tends to focus on observed facts that are close to the current test example in feature space. By leveraging these similar examples, the model maintains strong inductive capabilities within a localized region, significantly improving its deductive performance.

MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models

TL;DR

MIRAGE introduces a synthetic, flexible dataset to evaluate both inductive and deductive reasoning in large language models, addressing prior work's limited scope and data rigidity. It reveals that LLMs are not reliable rule-based inductive reasoners; instead, they leverage neighbor-based cues from observed facts, performing strong deductions within localized regions of the input space. The dataset construction uses a meta-rule library with five atomic operations to generate facts, varying dimension and observation count , and transforms these into diverse tasks across LT, RP, CG, and ST. Across multiple models and prompting regimes, MIRAGE shows limited gains from advanced reasoning prompts and highlights the importance of neighborhood similarity and form-related transfer in inductive reasoning. These findings have implications for designing evaluation protocols and guiding future methods to bolster robust inductive capabilities in LLMs.

Abstract

Inductive reasoning is an essential capability for large language models (LLMs) to achieve higher intelligence, which requires the model to generalize rules from observed facts and then apply them to unseen examples. We present MIRAGE, a synthetic dataset that addresses the limitations of previous work, specifically the lack of comprehensive evaluation and flexible test data. In it, we evaluate LLMs' capabilities in both the inductive and deductive stages, allowing for flexible variation in input distribution, task scenario, and task difficulty to analyze the factors influencing LLMs' inductive reasoning. Based on these multi-faceted evaluations, we demonstrate that the LLM is a poor rule-based reasoner. In many cases, when conducting inductive reasoning, they do not rely on a correct rule to answer the unseen case. From the perspectives of different prompting methods, observation numbers, and task forms, models tend to consistently conduct correct deduction without correct inductive rules. Besides, we find that LLMs are good neighbor-based reasoners. In the inductive reasoning process, the model tends to focus on observed facts that are close to the current test example in feature space. By leveraging these similar examples, the model maintains strong inductive capabilities within a localized region, significantly improving its deductive performance.

Paper Structure

This paper contains 50 sections, 5 theorems, 69 equations, 10 figures, 16 tables.

Key Result

Theorem 1

Let $\mathbf{A} = (a_1, a_2, \ldots, a_n) \in \mathbb{R}^n$. Define a mapping $f: \mathbb{R}^n \to \mathbb{R}^n$ such that for a fixed index $k \in \{1, 2, \ldots, n\}$ and a fixed subset $I \subseteq \{1, 2, \ldots, n\}$, we have where $k \notin I$. Then $f$ is a continuous function.

Figures (10)

  • Figure 1: An overview of two paradigms (i.e. rule-based and neighbor-based) in inductive reasoning.
  • Figure 2: Examples in four different scenarios of Mirage.
  • Figure 2: Comparison of CR on two tasks ($D$ = 3, $N$ = 3). BF and AF indicate the accuracy before and after perturbation.
  • Figure 3: The distribution of ICT and DCT for the examples across different models.
  • Figure 4: Performance on EI tasks under different scenarios of observed and test facts.
  • ...and 5 more figures

Theorems & Definitions (10)

  • Theorem 1: Add Operation Continuity
  • proof
  • Theorem 2: Copy Operation Continuity
  • proof
  • Theorem 3: Map Operation Continuity
  • proof
  • Theorem 4: Pad Operation Continuity
  • proof
  • Theorem 5: Swap Operation Continuity
  • proof