Table of Contents
Fetching ...

Thinking Preference Optimization

Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, Xiaotian Han

TL;DR

Thinking Preference Optimization (ThinkPO) addresses the late-stage bottleneck of improving reasoning in SFT-ed LLMs by reusing existing long CoT data. It employs Direct Preference Optimization with long CoT as chosen and short CoT as rejected samples, enabling the model to favor longer, more structured reasoning without collecting new long CoT responses. Across multiple math benchmarks and model sizes, ThinkPO delivers consistent gains in accuracy and output length, including an $MATH500$ improvement from 87.4% to 91.2% for public distillation and an overall 8.6% accuracy boost for SFT-ed models. This data-efficient post-SFT refinement thus enhances reasoning capabilities with minimal additional resources and is validated on diverse datasets and open-source models.

Abstract

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same question. It then applies direct preference optimization to encourage the model to favor longer reasoning outputs. Experiments show that ThinkPO further improves the reasoning performance of SFT-ed models, e.g. it increases math reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%. Notably, ThinkPO is capable of continually boosting the performance of the publicly distilled SFT model, e.g., increasing the official DeepSeek-R1-Distill-Qwen-7B's performance on MATH500 from 87.4% to 91.2%.

Thinking Preference Optimization

TL;DR

Thinking Preference Optimization (ThinkPO) addresses the late-stage bottleneck of improving reasoning in SFT-ed LLMs by reusing existing long CoT data. It employs Direct Preference Optimization with long CoT as chosen and short CoT as rejected samples, enabling the model to favor longer, more structured reasoning without collecting new long CoT responses. Across multiple math benchmarks and model sizes, ThinkPO delivers consistent gains in accuracy and output length, including an improvement from 87.4% to 91.2% for public distillation and an overall 8.6% accuracy boost for SFT-ed models. This data-efficient post-SFT refinement thus enhances reasoning capabilities with minimal additional resources and is validated on diverse datasets and open-source models.

Abstract

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same question. It then applies direct preference optimization to encourage the model to favor longer reasoning outputs. Experiments show that ThinkPO further improves the reasoning performance of SFT-ed models, e.g. it increases math reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%. Notably, ThinkPO is capable of continually boosting the performance of the publicly distilled SFT model, e.g., increasing the official DeepSeek-R1-Distill-Qwen-7B's performance on MATH500 from 87.4% to 91.2%.

Paper Structure

This paper contains 20 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The illustration of our method ThinkPO and its performance on math reasoning tasks. Top: Our ThinkPO enhances fine-tuned LLMs (+SFT) by promoting detailed problem-solving---using long chain-of-thought reasoning answers as positive (chosen) samples and short chain-of-thought reasoning answers as negative (rejected) samples. Bottom Left: ThinkPO significantly boosts performance across mathematical benchmarks (e.g., 83.4% on MATH500 vs. 82.8% for +SFT and 74.0% for the Base model). Bottom Right: ThinkPO generates more detailed solutions, with average completion lengths on AIME increasing from 0.94K to 21.57K to 23.9K tokens. These results underscore Think Preference Optimization's effectiveness in fostering and enhancing advanced mathematical reasoning.
  • Figure 2: Analysis of accuracy(Left), average response length(Middle) and reasoning-supportive words count(Right, like wait, hmm, etc) in SFT and ThinkPO process. We evaluate the model on MATH500 every 300 steps and record all the three metrics. In the early training stages, all of them improve significantly. However, in the later stages (e.g., after 1200 steps), the model’s performance gradually plateau. When ThinkPO is applied, we see additional improvements in all of the three aspects, demonstrating the effectiveness of Thinking Preference Optimization.
  • Figure 3: Data Collection Process: we use Deepseek R1 to generate long reasoning answers as chosen samples and Qwen 2.5-7B-Math to generate short reasoning answers as rejected samples, collecting datasets for DPO Training. Compare with short reasoning data, long reasoning answers includes many reasoning-supportive discourse markers, such as wait, hmm, and other hesitation cues, which can improve the model’s reasoning ability.
  • Figure 4: Visualization of improvements on Accuracy and Average Response Length of DeepSeek-R1-Distill-Qwen-7B (Left) and our finetuned Qwen2.5-7B-Instruct (Right) on four datasets After ThinkPO. ThinkPO could improve DeepSeek-7B's and our finetuned Qwen2.5-7B's accuracy and output lengths almost across all the datasets
  • Figure 6: Visualization of improvements on Accuracy and Average Response Length of models in the same family series from different sizes (Qwen-2.5-3B, Qwen-2.5-7B and Qwen-2.5-14B) on five datasets after ThinkPO. ThinkPO could improve models' accuracy and output lengths almost across all the datasets, regradless of sizes
  • ...and 2 more figures