Can Large Reasoning Models Improve Accuracy on Mathematical Tasks Using Flawed Thinking?
Saraswathy Amjith, Mihika Dusad, Neha Muramalla, Shweta Shah
TL;DR
This work tackles the brittleness of chain-of-thought reasoning in math by training models on deliberately flawed CoTs to cultivate error detection and recovery. It introduces a process-level framework that injects a single error at the first reasoning step, differentiates calculation versus reasoning errors, and optimizes with a binary final-answer reward using GRPO and LoRA-finetuned Qwen3-4B. Empirical results on MATH-lighteval show that mixed flawed-CoT training matches standard RL performance on clean problems while substantially improving robustness to adversarial or corrupted prompts, particularly when reasoning errors are involved. The findings suggest that exposure to flawed traces can yield more reliable mathematical reasoning without sacrificing accuracy, with broad implications for robust, interactive math education and reasoning tools.
Abstract
Chain-of-thought (CoT) prompting has become central to mathematical reasoning in large language models, yet models remain brittle to early errors: a single arithmetic slip or unjustified inference typically propagates uncorrected to an incorrect final answer. We investigate whether training on intentionally flawed reasoning traces can teach models to detect and recover from such errors without degrading standard problem-solving ability. Using competition-level problems from MATH-lighteval, we generate CoT prefixes containing exactly one controlled error, either a calculation error (sign flips, dropped terms) or a reasoning error (misapplied rules, unjustified logical steps), and fine-tune Qwen3-4B with GRPO using a binary final-answer reward. Our Mixed-CoT-RL model matches standard RL on clean problems (41% vs 41%) while substantially outperforming it on problems prefilled with flawed reasoning (24% vs 19%). Notably, clean-only RL fine-tuning degrades robustness below the untuned baseline 19% vs. 20%), indicating that conventional training increases susceptibility to misleading prefills. Among error types, training on reasoning errors yields greater robustness gains than calculation errors alone, with mixed training performing best. These findings demonstrate that exposure to flawed traces during training can improve error-recovery behavior without sacrificing accuracy, suggesting a path toward more robust mathematical reasoning in LLMs.
