Table of Contents
Fetching ...

Broken Chains: The Cost of Incomplete Reasoning in LLMs

Ian Su, Gaurav Purushothaman, Jey Narayan, Ruhika Goel, Kevin Zhu, Sunishchal Dev, Yash More, Maheep Chaudhary

TL;DR

This work analyzes how reasoning modality (code, natural language, hybrid, or none) interacts with token budgets to affect mathematical problem solving in four frontier LLMs. By constraining reasoning to specific modalities and performing token ablations from $10\%$ to $70\%$ of the unconstrained optimum, the authors reveal that truncated CoT can actively degrade performance, code-based reasoning often maintains higher accuracy under constraint, and hybrid approaches incur inefficiencies. The study spans datasets GSM8K, AIME, and HMMT, showing model-dependent robustness with Grok generally weathering budget cuts better than others. The findings imply careful consideration of reasoning modality when deploying reasoning-heavy models in latency- or cost-constrained environments, highlighting that incomplete reasoning chains may be more harmful than beneficial in practice.

Abstract

Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.

Broken Chains: The Cost of Incomplete Reasoning in LLMs

TL;DR

This work analyzes how reasoning modality (code, natural language, hybrid, or none) interacts with token budgets to affect mathematical problem solving in four frontier LLMs. By constraining reasoning to specific modalities and performing token ablations from to of the unconstrained optimum, the authors reveal that truncated CoT can actively degrade performance, code-based reasoning often maintains higher accuracy under constraint, and hybrid approaches incur inefficiencies. The study spans datasets GSM8K, AIME, and HMMT, showing model-dependent robustness with Grok generally weathering budget cuts better than others. The findings imply careful consideration of reasoning modality when deploying reasoning-heavy models in latency- or cost-constrained environments, highlighting that incomplete reasoning chains may be more harmful than beneficial in practice.

Abstract

Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.
Paper Structure (33 sections, 4 figures)

This paper contains 33 sections, 4 figures.

Figures (4)

  • Figure 1: Reasoning robustness under token constraints. (a) Average accuracy across reasoning conditions as token budget varies from 10% to 70% of optimal. Grok maintains high performance even at severe constraints, while other models degrade substantially. (b) Comparing no explicit reasoning versus truncated CoT at 10% budget reveals a striking paradox: for Gemini, GPT-5.1, and DeepSeek, no reasoning outperforms truncated reasoning, suggesting incomplete chains actively mislead models. Only Grok benefits from truncated CoT.
  • Figure 2: Full performance across all models and datasets. Success rates by model--dataset pair for each reasoning condition (Code-only, Comments-only, Both, Nothing, CoT). Bars show exact-match accuracy under unconstrained generation and tokens.
  • Figure 3: Token-budget ablation for 10% of the per-setting optimal token count. The x-axis shows the models, and the y-axis shows the success rates given a proportion of token ablation.
  • Figure 4: Token-budget ablation for 10%, 30%, 50%, and 70% of the per-setting optimal token count. The x-axis shows the models, and the y-axis shows the success rates given a proportion of token ablation.