Table of Contents
Fetching ...

Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, Xunliang Cai

TL;DR

The paper introduces CoTP, a data-efficient framework to expand foundation model reasoning by mining high-value CoT patterns and token entropy, assembling a core set of reasoning patterns, and selecting training data via a dual-granularity matching algorithm. By defining the model potential Φ as the probability of correct sampling and linking it to the inverse of the expected number of attempts, the authors formalize data selection as approaching an ideal oracle dataset. Through a two-tiered core-set construction (pattern chains and entropy chains) and a weighted DTW assignment solution, CoTP curates long-CoT data that aligns with the core set, enabling substantial improvements on challenging mathematical reasoning tasks (e.g., up to 9.58% on AIME 2024/2025 with 10B data) and boosting downstream RL performance. The approach demonstrates strong scalability, maintaining general performance at larger data volumes and offering insights into why certain reasoning patterns enable robust generalization and introspective capabilities across STEM domains.

Abstract

Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.

Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

TL;DR

The paper introduces CoTP, a data-efficient framework to expand foundation model reasoning by mining high-value CoT patterns and token entropy, assembling a core set of reasoning patterns, and selecting training data via a dual-granularity matching algorithm. By defining the model potential Φ as the probability of correct sampling and linking it to the inverse of the expected number of attempts, the authors formalize data selection as approaching an ideal oracle dataset. Through a two-tiered core-set construction (pattern chains and entropy chains) and a weighted DTW assignment solution, CoTP curates long-CoT data that aligns with the core set, enabling substantial improvements on challenging mathematical reasoning tasks (e.g., up to 9.58% on AIME 2024/2025 with 10B data) and boosting downstream RL performance. The approach demonstrates strong scalability, maintaining general performance at larger data volumes and offering insights into why certain reasoning patterns enable robust generalization and introspective capabilities across STEM domains.

Abstract

Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.

Paper Structure

This paper contains 52 sections, 2 theorems, 13 equations, 13 figures, 12 tables, 3 algorithms.

Key Result

Corollary 1

Let $K_i$ denote the first-passage time for question $q_i$, representing the number of independent attempts required to solve $q_i$. Suppose each attempt is an independent Bernoulli trial with success probability $\Phi(\mathcal{M}, q_i)$, so that $K_i \sim \mathrm{Geom}(\Phi(\mathcal{M}, q_i))$. The In other words, the model potential is the inverse of the expected first-passage time and a smaller

Figures (13)

  • Figure 1: Illustration of the CoTP framework. The left figure shows the overall process of the CoTP framework, while the right figure shows the graph of reasoning chains. The patterns with higher TF-IDF weights are important, while the remaining patterns are considered normal. The CoTP framework selects the minimum distance chain from the source data pool.
  • Figure 2: The comparison of pass@k and RL performance across different datasets.
  • Figure 3: Scalability of data volume examining the SFT performance of models mid-trained on datasets of varying volumes. The dashed lines represent the performance of each dataset configured under the 30B-token setting.
  • Figure 4: Illustration of DTW alignment analysis on the pattern chain similarity matrix.
  • Figure 5: Examples of reasoning patterns with different levels of importance.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Definition 1: Model Potential
  • Corollary 1
  • Definition 2: Reasoning Pattern and Pattern Chain
  • Theorem 2