Table of Contents
Fetching ...

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Weixiang Zhao, Xingyu Sui, Jiahe Guo, Yulin Hu, Yang Deng, Yanyan Zhao, Xuda Zhi, Yongbo Huang, Hao He, Wanxiang Che, Ting Liu, Bing Qin

TL;DR

This work investigates the trade-offs of equipping large reasoning models with deliberative reasoning through distillation or reinforcement learning. Across DeepSeek, Qwen, and LLaMA families at 7B–32B scales, it shows that stronger deliberative reasoning markedly degrades foundational capabilities like helpfulness and safety while increasing inference costs. The authors demonstrate that adaptive reasoning modes—Zero-Thinking, Less-Thinking, and Summary-Thinking—can mitigate some of these drawbacks and improve performance on various general tasks. They argue for designing LRMs capable of dynamically allocating inference-time compute according to task characteristics to achieve balanced, versatile performance.

Abstract

Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 32B) reveals that acquiring these deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs, including notable declines in helpfulness and harmlessness, alongside substantially increased inference costs. Importantly, we demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks. Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics.

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

TL;DR

This work investigates the trade-offs of equipping large reasoning models with deliberative reasoning through distillation or reinforcement learning. Across DeepSeek, Qwen, and LLaMA families at 7B–32B scales, it shows that stronger deliberative reasoning markedly degrades foundational capabilities like helpfulness and safety while increasing inference costs. The authors demonstrate that adaptive reasoning modes—Zero-Thinking, Less-Thinking, and Summary-Thinking—can mitigate some of these drawbacks and improve performance on various general tasks. They argue for designing LRMs capable of dynamically allocating inference-time compute according to task characteristics to achieve balanced, versatile performance.

Abstract

Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 32B) reveals that acquiring these deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs, including notable declines in helpfulness and harmlessness, alongside substantially increased inference costs. Importantly, we demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks. Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics.

Paper Structure

This paper contains 34 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Comparison of efficacy and efficiency of different LRMs and their chat-versions LLMs.
  • Figure 2: The thought and response lengths of various 32B-scale LRMs across benchmarks.
  • Figure 3: Performance analysis of LRMs under the Less-Thinking mode across multiple benchmarks. The x-axis denotes the Thinking Ratio, indicating the proportion of deliberate reasoning steps utilized during inference. (a) The results for the distilled LRM (s1.1-32B), (b) The results for the reinforcement learning-based LRM (QwQ-32B).
  • Figure 4: Detailed prompt for the safety evaluation on StrongReject.
  • Figure 5: Detailed prompt for the safety evaluation on WildJailbreak.
  • ...and 5 more figures