Broken Chains: The Cost of Incomplete Reasoning in LLMs
Ian Su, Gaurav Purushothaman, Jey Narayan, Ruhika Goel, Kevin Zhu, Sunishchal Dev, Yash More, Maheep Chaudhary
TL;DR
This work analyzes how reasoning modality (code, natural language, hybrid, or none) interacts with token budgets to affect mathematical problem solving in four frontier LLMs. By constraining reasoning to specific modalities and performing token ablations from $10\%$ to $70\%$ of the unconstrained optimum, the authors reveal that truncated CoT can actively degrade performance, code-based reasoning often maintains higher accuracy under constraint, and hybrid approaches incur inefficiencies. The study spans datasets GSM8K, AIME, and HMMT, showing model-dependent robustness with Grok generally weathering budget cuts better than others. The findings imply careful consideration of reasoning modality when deploying reasoning-heavy models in latency- or cost-constrained environments, highlighting that incomplete reasoning chains may be more harmful than beneficial in practice.
Abstract
Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.
