Table of Contents
Fetching ...

Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

Cehao Yang, Xueyuan Lin, Xiaojun Wu, Chengjin Xu, Xuhui Jiang, Honghao Liu, Hui Xiong, Jian Guo

TL;DR

Select2Reason addresses the data inefficiency of long-CoT instruction tuning by automatically selecting a small, high-utility subset from large instruction pools. It jointly leverages a rollout-derived difficulty signal and a normalized reasoning-trace length via a weighted joint ranker, enabling high-quality SFT with only a fraction of data. Empirical results on OpenR1-Math-220k across nine math benchmarks show that using about 10% of data selected by Select2Reason matches or exceeds full-data tuning and outperforms several baselines, with strong generalization to other pools and model scales and substantial training-time savings. The method offers a practical, scalable path to activating long-CoT reasoning in diverse domains while reducing computational cost, making high-quality long-CoT capabilities more accessible for real-world applications.

Abstract

A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.

Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

TL;DR

Select2Reason addresses the data inefficiency of long-CoT instruction tuning by automatically selecting a small, high-utility subset from large instruction pools. It jointly leverages a rollout-derived difficulty signal and a normalized reasoning-trace length via a weighted joint ranker, enabling high-quality SFT with only a fraction of data. Empirical results on OpenR1-Math-220k across nine math benchmarks show that using about 10% of data selected by Select2Reason matches or exceeds full-data tuning and outperforms several baselines, with strong generalization to other pools and model scales and substantial training-time savings. The method offers a practical, scalable path to activating long-CoT reasoning in diverse domains while reducing computational cost, making high-quality long-CoT capabilities more accessible for real-world applications.

Abstract

A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.

Paper Structure

This paper contains 34 sections, 7 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Statistics of rethinking tokens in reasoning trace. Longer reasoning traces exhibit a higher frequency of rethinking tokens in each step such as Wait, Alternatively, Maybe, However, which also occurs often in instruction with questions that are hard to solve.
  • Figure 2: The brief pipeline of Select2Reason framework. With a large-scale instruction pool given, we select those data that can maximize the learning value of subsets via controlling the problem difficulty and reasoning trace length, which are motivated by the frequency of rethinking tokens during reasoning. The long-CoT reasoning ability of downstream model are activated after performing low-cost supervised fine-tuning on the instruction subset.
  • Figure 3: Performance across three expert-level mathematical benchmarks, using instruction subsets selected based on the length of reasoning traces, which are divided into the longest, the shortest and the middle.
  • Figure 4: Comparison of Instructions with Varying Reasoning Trace Lengths. Long reasoning trajectories incorporate more human-like behaviors such as reflection, backtracking, and planning, which serve as higher-quality supervision signals during fine-tuning. In contrast, short traces often omit substantive decision-making steps and explicitly bypass reasoning by using empty constructs like <think>\\ n</think>, rendering them ineffective.
  • Figure 5: Pass@1 across six math benchmarks. Easy and hard examples for base model are selected for training, and the latter brings more learning value.
  • ...and 8 more figures