Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning
Cehao Yang, Xueyuan Lin, Xiaojun Wu, Chengjin Xu, Xuhui Jiang, Honghao Liu, Hui Xiong, Jian Guo
TL;DR
Select2Reason addresses the data inefficiency of long-CoT instruction tuning by automatically selecting a small, high-utility subset from large instruction pools. It jointly leverages a rollout-derived difficulty signal and a normalized reasoning-trace length via a weighted joint ranker, enabling high-quality SFT with only a fraction of data. Empirical results on OpenR1-Math-220k across nine math benchmarks show that using about 10% of data selected by Select2Reason matches or exceeds full-data tuning and outperforms several baselines, with strong generalization to other pools and model scales and substantial training-time savings. The method offers a practical, scalable path to activating long-CoT reasoning in diverse domains while reducing computational cost, making high-quality long-CoT capabilities more accessible for real-world applications.
Abstract
A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.
