Table of Contents
Fetching ...

Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models

Sohyun An, Ruochen Wang, Tianyi Zhou, Cho-Jui Hsieh

TL;DR

Large Reasoning Models trained with outcome-based RL often overthink, yielding long and costly reasoning traces. The authors propose Dynamic Thinking Pattern Optimization (DTO), a framework that segmentally identifies and prunes unhelpful thinking patterns while reinforcing beneficial ones to produce concise, effective reasoning trajectories. They further employ a pairwise preference approach (SimPO) to encourage the model to favor efficient trajectories, achieving substantial efficiency gains (up to 47% attention FLOPs reduction) and accuracy improvements on several benchmarks, with token usage dropping notably (roughly 5,000 to 3,000 tokens). The method generalizes beyond mathematics to diverse domains (e.g., MMLU-Pro) and demonstrates practical potential for cheaper, reliable reasoning in Large Reasoning Models.

Abstract

While recent success of large reasoning models (LRMs) significantly advanced LLMs' reasoning capability by optimizing the final answer accuracy using reinforcement learning, they may also drastically increase the output length due to overthinking, characterized by unnecessarily complex reasoning paths that waste computation and potentially degrade the performance. We hypothesize that such inefficiencies stem from LRMs' limited capability to dynamically select the proper modular reasoning strategies, termed thinking patterns at the right position. To investigate this hypothesis, we propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns, systematically identifying and promoting beneficial patterns that improve the answer while removing detrimental ones. Empirical analysis confirms that our optimized thinking paths yield more concise yet sufficiently informative trajectories, enhancing reasoning efficiency by reducing attention FLOPs by up to 47% while maintaining accuracy for originally correct responses. Moreover, a non-trivial portion of originally incorrect responses are transformed into correct ones, achieving a 15.6% accuracy improvement with reduced length. Motivated by the improvement brought by the optimized thinking paths, we apply a preference optimization technique supported by a pairwise dataset contrasting suboptimal and optimal reasoning paths. Experimental evaluations across multiple mathematical reasoning benchmarks reveal that our method notably reduces computational overhead while simultaneously improving reasoning accuracy, achieving up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.

Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models

TL;DR

Large Reasoning Models trained with outcome-based RL often overthink, yielding long and costly reasoning traces. The authors propose Dynamic Thinking Pattern Optimization (DTO), a framework that segmentally identifies and prunes unhelpful thinking patterns while reinforcing beneficial ones to produce concise, effective reasoning trajectories. They further employ a pairwise preference approach (SimPO) to encourage the model to favor efficient trajectories, achieving substantial efficiency gains (up to 47% attention FLOPs reduction) and accuracy improvements on several benchmarks, with token usage dropping notably (roughly 5,000 to 3,000 tokens). The method generalizes beyond mathematics to diverse domains (e.g., MMLU-Pro) and demonstrates practical potential for cheaper, reliable reasoning in Large Reasoning Models.

Abstract

While recent success of large reasoning models (LRMs) significantly advanced LLMs' reasoning capability by optimizing the final answer accuracy using reinforcement learning, they may also drastically increase the output length due to overthinking, characterized by unnecessarily complex reasoning paths that waste computation and potentially degrade the performance. We hypothesize that such inefficiencies stem from LRMs' limited capability to dynamically select the proper modular reasoning strategies, termed thinking patterns at the right position. To investigate this hypothesis, we propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns, systematically identifying and promoting beneficial patterns that improve the answer while removing detrimental ones. Empirical analysis confirms that our optimized thinking paths yield more concise yet sufficiently informative trajectories, enhancing reasoning efficiency by reducing attention FLOPs by up to 47% while maintaining accuracy for originally correct responses. Moreover, a non-trivial portion of originally incorrect responses are transformed into correct ones, achieving a 15.6% accuracy improvement with reduced length. Motivated by the improvement brought by the optimized thinking paths, we apply a preference optimization technique supported by a pairwise dataset contrasting suboptimal and optimal reasoning paths. Experimental evaluations across multiple mathematical reasoning benchmarks reveal that our method notably reduces computational overhead while simultaneously improving reasoning accuracy, achieving up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.

Paper Structure

This paper contains 33 sections, 14 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of DTO. We construct a truncated reasoning trajectory $\Delta_x^f$ by identifying the point where the probability score $p_i$ in \ref{['eq:prob']} exceeds a threshold $T=1.0$, and then applying the binary selection function $f(\cdot)$ from \ref{['eq:ops_f']}. We then append the finalization pattern $\delta_{\text{finalize}}$ and sampled answer $s^*$ (\ref{['eq:finalize']}) to form $\tilde{\Delta}_x^f$. Finally, the pruning function $g(\cdot)$ (\ref{['eq:ops_g']}) refines the trajectory into the optimized version $\Delta_x^g$, as illustrated in the orange box.
  • Figure 2: Comparison of dynamically optimized vs. original responses, and $\bm{\max_i p_i}$ distributions in incorrect cases.(a), (b) Dynamic optimization preserves accuracy for correct responses while reducing attention FLOPs (47%, 40%), and improves accuracy for incorrect ones (15.6%, 7.8%) with lower FLOPs. (c) shows $\max_i p_i$, the maximum estimated correctness probability across thinking patterns (\ref{['eq:prob']}). High values suggest that even incorrect trajectories often contain a promising intermediate segment.
  • Figure 3: The average count of "Wait", which is one of the words signaling a thinking pattern transition. Compared to all baselines, our framework generally results in the lowest average count of "Wait", suggesting more concise and less interrupted reasoning trajectories. The results are averaged over 4 runs.
  • Figure 4: Distribution of $\bm{\max_i p_i}$ in incorrect responses of DeepScaleR-1.5B-Preview.