Table of Contents
Fetching ...

Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability

Haotian Wang, Han Zhao, Shuaiting Chen, Xiaoyu Tian, Sitong Zhao, Yunjie Ji, Yiping Peng, Xiangang Li

TL;DR

This work investigates transferring the high-quality reasoning outputs of large reasoning-focused models to cheaper non-reasoning models via Supervised Fine-Tuning. By constructing a large, diverse SFT dataset and evaluating multiple strategies for utilizing reasoning traces—ranging from using only the final answer to concatenating summarized reasoning with answers—the study demonstrates that knowledge distilled from reasoning models can improve performance on several benchmarks (e.g., GSM8K and HumanEval) but may impair conversational alignment in some setups. A key finding is that the way reasoning information is structured during fine-tuning critically influences transfer effectiveness, with think-summarization offering a balanced advantage across tasks while direct final-answers can overfit to non-conversational tasks. The results suggest practical pathways for knowledge distillation and call for future work on more integrated representations of reasoning traces and prompt-based elicitation of reasoning within final responses.

Abstract

Recent advancements in large language models (LLMs), such as DeepSeek-R1 and OpenAI-o1, have demonstrated the significant effectiveness of test-time scaling, achieving substantial performance gains across various benchmarks. These advanced models utilize deliberate "thinking" steps to systematically enhance answer quality. In this paper, we propose leveraging these high-quality outputs generated by reasoning-intensive models to improve less computationally demanding, non-reasoning models. We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models. Through straightforward Supervised Fine-Tuning (SFT) experiments on established benchmarks, we demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.

Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability

TL;DR

This work investigates transferring the high-quality reasoning outputs of large reasoning-focused models to cheaper non-reasoning models via Supervised Fine-Tuning. By constructing a large, diverse SFT dataset and evaluating multiple strategies for utilizing reasoning traces—ranging from using only the final answer to concatenating summarized reasoning with answers—the study demonstrates that knowledge distilled from reasoning models can improve performance on several benchmarks (e.g., GSM8K and HumanEval) but may impair conversational alignment in some setups. A key finding is that the way reasoning information is structured during fine-tuning critically influences transfer effectiveness, with think-summarization offering a balanced advantage across tasks while direct final-answers can overfit to non-conversational tasks. The results suggest practical pathways for knowledge distillation and call for future work on more integrated representations of reasoning traces and prompt-based elicitation of reasoning within final responses.

Abstract

Recent advancements in large language models (LLMs), such as DeepSeek-R1 and OpenAI-o1, have demonstrated the significant effectiveness of test-time scaling, achieving substantial performance gains across various benchmarks. These advanced models utilize deliberate "thinking" steps to systematically enhance answer quality. In this paper, we propose leveraging these high-quality outputs generated by reasoning-intensive models to improve less computationally demanding, non-reasoning models. We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models. Through straightforward Supervised Fine-Tuning (SFT) experiments on established benchmarks, we demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.

Paper Structure

This paper contains 13 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Benchmark performance of different answer utillize method
  • Figure 2: Category Distribution of dataset
  • Figure 3: prompt used in inference