Table of Contents
Fetching ...

Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences

Kedi Chen, Zhikai Lei, Fan Zhang, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Qipeng Guo, Kai Chen, Wei Zhang

TL;DR

This work tackles the underexplored area of inductive reasoning in large language models by addressing the data bottleneck with CodeSeq, a synthetic dataset built from number sequences. It treats finding the general term $a_n$ as a code problem and injects case-based supervision through code unit tests within a three-stage synthetic data pipeline. Finetuning LLMs with CodeSeq yields improvements on code-generation benchmarks and strong transfer to broad reasoning tasks, indicating that inductive reasoning data can meaningfully enhance reasoning abilities. The approach demonstrates the practical potential of harnessing sequence-based inductive tasks to boost generalization and reasoning in LLMs.

Abstract

Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive reasoning, is not well studied. We attribute the reason to the fact that obtaining high-quality process supervision data is challenging for inductive reasoning. Towards this end, we novelly employ number sequences as the source of inductive reasoning data. We package sequences into algorithmic problems to find the general term of each sequence through a code solution. In this way, we can verify whether the code solution holds for any term in the current sequence, and inject case-based supervision signals by using code unit tests. We build a sequence synthetic data pipeline and form a training dataset CodeSeq. Experimental results show that the models tuned with CodeSeq improve on both code and comprehensive reasoning benchmarks.

Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences

TL;DR

This work tackles the underexplored area of inductive reasoning in large language models by addressing the data bottleneck with CodeSeq, a synthetic dataset built from number sequences. It treats finding the general term as a code problem and injects case-based supervision through code unit tests within a three-stage synthetic data pipeline. Finetuning LLMs with CodeSeq yields improvements on code-generation benchmarks and strong transfer to broad reasoning tasks, indicating that inductive reasoning data can meaningfully enhance reasoning abilities. The approach demonstrates the practical potential of harnessing sequence-based inductive tasks to boost generalization and reasoning in LLMs.

Abstract

Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive reasoning, is not well studied. We attribute the reason to the fact that obtaining high-quality process supervision data is challenging for inductive reasoning. Towards this end, we novelly employ number sequences as the source of inductive reasoning data. We package sequences into algorithmic problems to find the general term of each sequence through a code solution. In this way, we can verify whether the code solution holds for any term in the current sequence, and inject case-based supervision signals by using code unit tests. We build a sequence synthetic data pipeline and form a training dataset CodeSeq. Experimental results show that the models tuned with CodeSeq improve on both code and comprehensive reasoning benchmarks.

Paper Structure

This paper contains 38 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: We select 200 sequences and prompt three powerful models for next number prediction (more details in Appendix \ref{['app:next number']}). The results demonstrate that existing LLMs perform poorly in inductive reasoning, indicating significant research potential in this area.
  • Figure 2: The sequence synthetic data pipeline consists of three steps, and then forming our CodeSeq.
  • Figure 3: We respectively carry out next number prediction using LLaMA3-8B and Qwen2.5-7B before and after training, to test their inductive reasoning abilities.
  • Figure 4: The prompt for the next number prediction task.
  • Figure 5: An example of one OEIS webpage. This webpage includes the sequence, sequence offsets, sequence references, sequence links to other supplementary information, examples in the explanation process, mathematical explanations, the relationship between sequences, and so on.
  • ...and 9 more figures