Table of Contents
Fetching ...

Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations

Chunyang Li, Weiqi Wang, Tianshi Zheng, Yangqiu Song

TL;DR

This work investigates the robustness of inductive reasoning in large language models when observations are noisy. It introduces Robust Rule Induction and Sample-steered Rule Refinement (SRR), a three-phase strategy that combines diversity-driven hypothesis generation with execution-guided feedback to refine rules. Across arithmetic, cryptography, and list-function tasks, SRR outperforms baselines in many settings while exposing persistent instability via a consistency metric and counterfactual analyses that reveal reliance on memorized patterns. The findings underscore the gap between surface-level accuracy and genuine, human-like inductive robustness, with implications for building more reliable, rule-based AI systems; code and data are publicly available.

Abstract

Inductive reasoning, a cornerstone of human cognition, enables generalization from limited data but hasn't yet been fully achieved by large language models (LLMs). While modern LLMs excel at reasoning tasks, their ability to maintain stable and consistent rule abstraction under imperfect observations remains underexplored. To fill this gap, in this work, we introduce Robust Rule Induction, a task that evaluates LLMs' capability in inferring rules from data that are fused with noisy examples. To address this task, we further propose Sample-steered Rule Refinement (SRR), a method enhancing reasoning stability via observation diversification and execution-guided feedback. Experiments across arithmetic, cryptography, and list functions reveal: (1) SRR outperforms other methods with minimal performance degradation under noise; (2) Despite slight accuracy variation, LLMs exhibit instability under noise (e.g., 0% accuracy change with only 70% consistent score); (3) Counterfactual task gaps highlight LLMs' reliance on memorized patterns over genuine abstraction. Our findings challenge LLMs' reasoning robustness, revealing susceptibility to hypothesis drift and pattern overfitting, while providing empirical evidence critical for developing human-like inductive systems. Code and data are available at https://github.com/HKUST-KnowComp/Robust-Rule-Induction.

Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations

TL;DR

This work investigates the robustness of inductive reasoning in large language models when observations are noisy. It introduces Robust Rule Induction and Sample-steered Rule Refinement (SRR), a three-phase strategy that combines diversity-driven hypothesis generation with execution-guided feedback to refine rules. Across arithmetic, cryptography, and list-function tasks, SRR outperforms baselines in many settings while exposing persistent instability via a consistency metric and counterfactual analyses that reveal reliance on memorized patterns. The findings underscore the gap between surface-level accuracy and genuine, human-like inductive robustness, with implications for building more reliable, rule-based AI systems; code and data are publicly available.

Abstract

Inductive reasoning, a cornerstone of human cognition, enables generalization from limited data but hasn't yet been fully achieved by large language models (LLMs). While modern LLMs excel at reasoning tasks, their ability to maintain stable and consistent rule abstraction under imperfect observations remains underexplored. To fill this gap, in this work, we introduce Robust Rule Induction, a task that evaluates LLMs' capability in inferring rules from data that are fused with noisy examples. To address this task, we further propose Sample-steered Rule Refinement (SRR), a method enhancing reasoning stability via observation diversification and execution-guided feedback. Experiments across arithmetic, cryptography, and list functions reveal: (1) SRR outperforms other methods with minimal performance degradation under noise; (2) Despite slight accuracy variation, LLMs exhibit instability under noise (e.g., 0% accuracy change with only 70% consistent score); (3) Counterfactual task gaps highlight LLMs' reliance on memorized patterns over genuine abstraction. Our findings challenge LLMs' reasoning robustness, revealing susceptibility to hypothesis drift and pattern overfitting, while providing empirical evidence critical for developing human-like inductive systems. Code and data are available at https://github.com/HKUST-KnowComp/Robust-Rule-Induction.

Paper Structure

This paper contains 39 sections, 1 equation, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Example instances with noise and rules from Arithmetic$_{\text{base-7}}$, Cryptography$_{\text{Caesar}}$ and List Functions.
  • Figure 2: Evaluation pipeline exemplified by base-9 addition, consisting of three stages: (1) Data Synthesis, generating normal, noisy and test examples; (2) Model Inference, prompting models with seen examples to induce rules in Python function form; (3) Performance Evaluation, executing induced rules on test examples to assess correctness and robustness under noise.
  • Figure 3: Consistency score$(\%)$ with clean data of different models on the Cryptography and List Functions datasets under different noise levels.
  • Figure 4: Consistency score $(\%)$ between clean data and data with $10\%$ noise of DeepSeek-V3.
  • Figure 5: Task-solving consistency of DeepSeek-V3 and GPT-4o on List Functions. Each cell represents a task, arranged by ascending difficulty (top-to-bottom, left-to-right). Colors denote correctness patterns: 12R (all correct) to 12W (all wrong), with intermediate states (e.g., 1W11R: 1 wrong, 11 correct).
  • ...and 1 more figures