Table of Contents
Fetching ...

Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

Ziqian Bi, Lu Chen, Junhao Song, Hongying Luo, Enze Ge, Junmin Huang, Tianyang Wang, Keyu Chen, Chia Xin Liang, Zihan Wei, Huafeng Liu, Chunjie Tian, Jibin Guan, Joe Yeong, Yongzhi Xu, Peng Wang, Xinyuan Song, Junfeng Hao

TL;DR

This work introduces thinking budget as an inference-time resource to optimize medical reasoning. By evaluating Qwen3 and DeepSeek-R1 across 15 diverse medical datasets, the authors uncover a logarithmic relationship between reasoning depth, model size, and accuracy, revealing three practical regimes for token budgets. They formalize budgeted reasoning with a scaling framework and validate cross-architecture generality via a truncation approach that preserves same-model reasoning content. The findings enable dynamic resource allocation in clinical AI with transparent, verifiable reasoning traces, offering actionable guidance for deployment and future adaptive inference strategies in high-stakes healthcare. Overall, thinking budget control emerges as a principled mechanism to balance accuracy, efficiency, and interpretability in medical AI systems.

Abstract

This study presents the first comprehensive evaluation of thinking budget mechanisms in medical reasoning tasks, revealing fundamental scaling laws between computational resources and reasoning quality. We systematically evaluated two major model families, Qwen3 (1.7B to 235B parameters) and DeepSeek-R1 (1.5B to 70B parameters), across 15 medical datasets spanning diverse specialties and difficulty levels. Through controlled experiments with thinking budgets ranging from zero to unlimited tokens, we establish logarithmic scaling relationships where accuracy improvements follow a predictable pattern with both thinking budget and model size. Our findings identify three distinct efficiency regimes: high-efficiency (0 to 256 tokens) suitable for real-time applications, balanced (256 to 512 tokens) offering optimal cost-performance tradeoffs for routine clinical support, and high-accuracy (above 512 tokens) justified only for critical diagnostic tasks. Notably, smaller models demonstrate disproportionately larger benefits from extended thinking, with 15 to 20% improvements compared to 5 to 10% for larger models, suggesting a complementary relationship where thinking budget provides greater relative benefits for capacity-constrained models. Domain-specific patterns emerge clearly, with neurology and gastroenterology requiring significantly deeper reasoning processes than cardiovascular or respiratory medicine. The consistency between Qwen3 native thinking budget API and our proposed truncation method for DeepSeek-R1 validates the generalizability of thinking budget concepts across architectures. These results establish thinking budget control as a critical mechanism for optimizing medical AI systems, enabling dynamic resource allocation aligned with clinical needs while maintaining the transparency essential for healthcare deployment.

Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

TL;DR

This work introduces thinking budget as an inference-time resource to optimize medical reasoning. By evaluating Qwen3 and DeepSeek-R1 across 15 diverse medical datasets, the authors uncover a logarithmic relationship between reasoning depth, model size, and accuracy, revealing three practical regimes for token budgets. They formalize budgeted reasoning with a scaling framework and validate cross-architecture generality via a truncation approach that preserves same-model reasoning content. The findings enable dynamic resource allocation in clinical AI with transparent, verifiable reasoning traces, offering actionable guidance for deployment and future adaptive inference strategies in high-stakes healthcare. Overall, thinking budget control emerges as a principled mechanism to balance accuracy, efficiency, and interpretability in medical AI systems.

Abstract

This study presents the first comprehensive evaluation of thinking budget mechanisms in medical reasoning tasks, revealing fundamental scaling laws between computational resources and reasoning quality. We systematically evaluated two major model families, Qwen3 (1.7B to 235B parameters) and DeepSeek-R1 (1.5B to 70B parameters), across 15 medical datasets spanning diverse specialties and difficulty levels. Through controlled experiments with thinking budgets ranging from zero to unlimited tokens, we establish logarithmic scaling relationships where accuracy improvements follow a predictable pattern with both thinking budget and model size. Our findings identify three distinct efficiency regimes: high-efficiency (0 to 256 tokens) suitable for real-time applications, balanced (256 to 512 tokens) offering optimal cost-performance tradeoffs for routine clinical support, and high-accuracy (above 512 tokens) justified only for critical diagnostic tasks. Notably, smaller models demonstrate disproportionately larger benefits from extended thinking, with 15 to 20% improvements compared to 5 to 10% for larger models, suggesting a complementary relationship where thinking budget provides greater relative benefits for capacity-constrained models. Domain-specific patterns emerge clearly, with neurology and gastroenterology requiring significantly deeper reasoning processes than cardiovascular or respiratory medicine. The consistency between Qwen3 native thinking budget API and our proposed truncation method for DeepSeek-R1 validates the generalizability of thinking budget concepts across architectures. These results establish thinking budget control as a critical mechanism for optimizing medical AI systems, enabling dynamic resource allocation aligned with clinical needs while maintaining the transparency essential for healthcare deployment.

Paper Structure

This paper contains 29 sections, 14 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Overview of our three-stage thinking budget evaluation pipeline for medical reasoning tasks. Stage 1 (Unconstrained Thinking Generation): Both model families generate complete reasoning traces without budget constraints. Qwen3 produces full thinking processes via native API calls, while DeepSeek-R1 generates unrestricted reasoning sequences stored in a buffer for subsequent processing. Stage 2 (Thinking Budget Simulation): Two distinct control mechanisms are employed. For Qwen3, budget constraints are enforced directly through the native thinking_budget API parameter during generation. For DeepSeek-R1, which lacks native budget control, we implement a truncation-based approach: (i) extract thinking content between <think> and </think> tags from Stage 1 outputs, (ii) apply token-level truncation at predefined budgets $T_b \in \{0, 64, 128, 256, 512, 1024\}$, and (iii) preserve the inf condition as the original full reasoning trace. Stage 3 (Constrained Inference & Response): Budget-constrained prompts are reconstructed by combining the original medical query with truncated thinking content. These reconstructed prompts are then fed to the inference engine for final answer generation.
  • Figure 2: Illustration of the Thinking Model and Budget Mechanism. The top panel shows how medical queries are processed through Qwen3's thinking mode with controllable budget allocation. The bottom panel demonstrates how progressive thinking leads to better answers (left table) and visualizes the three efficiency regimes identified in our study (right chart).
  • Figure 3: Dataset difficulty ranking based on Qwen3:235B performance with unlimited thinking budget. Accuracy ranges from 88.5% (Attending Cardiovascular) to 59.5% (Chief Neurology), illustrating the varying complexity of medical reasoning tasks across specialties.
  • Figure 4: Logarithmic scaling of accuracy with thinking budget across model families. Scatter points show empirical results for six representative models (3 from each family), with fitted regression lines demonstrating the scaling law $\text{Accuracy} = \alpha \ln(T_b + 1) + \beta \ln(M_s) + \gamma$. Background shading indicates efficiency regimes: high efficiency (blue, 0-256 tokens), balanced (yellow, 256-512 tokens), and high accuracy (red, 512+ tokens). The 95% confidence interval (gray dotted lines) shows the consistency of the scaling relationship. Smaller models exhibit steeper slopes ($\alpha \approx 0.095$) compared to larger models ($\alpha \approx 0.08$), confirming that thinking budget provides greater relative benefits for capacity-constrained models.
  • Figure 5: Comprehensive results for Neurology datasets (most challenging). The top row shows performance curves across thinking budgets for both model families at the chief and attending levels. Bottom row compares model performance without thinking (None) versus unlimited thinking (inf), revealing substantial improvements from thinking processes, especially for smaller models.
  • ...and 11 more figures