Table of Contents
Fetching ...

MixReasoning: Switching Modes to Think

Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, Xinchao Wang

TL;DR

MixReasoning addresses the inefficiency of uniformly long chain-of-thought reasoning by adaptively adjusting reasoning depth within a single model response. It uses a lightweight LoRA adapter to switch between concise and detailed thinking at locally uncertain points, guided by token-level entropy without retraining or coordinating multiple models. Empirical results across GSM8K, MATH-500, and AIME demonstrate reduced reasoning length and improved efficiency while maintaining or improving accuracy, with a controllable budget via window size and uncertainty thresholds. The approach preserves KV-cache reuse and preserves base-model capabilities, offering a practical, memory-friendly path to more readable and cost-efficient reasoning in interactive settings.

Abstract

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning shortens reasoning length and substantially improves efficiency without compromising accuracy.

MixReasoning: Switching Modes to Think

TL;DR

MixReasoning addresses the inefficiency of uniformly long chain-of-thought reasoning by adaptively adjusting reasoning depth within a single model response. It uses a lightweight LoRA adapter to switch between concise and detailed thinking at locally uncertain points, guided by token-level entropy without retraining or coordinating multiple models. Empirical results across GSM8K, MATH-500, and AIME demonstrate reduced reasoning length and improved efficiency while maintaining or improving accuracy, with a controllable budget via window size and uncertainty thresholds. The approach preserves KV-cache reuse and preserves base-model capabilities, offering a practical, memory-friendly path to more readable and cost-efficient reasoning in interactive settings.

Abstract

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning shortens reasoning length and substantially improves efficiency without compromising accuracy.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The comparison among Long-to-short compression, Hybrid reasoning, and MixReasoning. MixReasoning dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones.
  • Figure 2: MixReasoning use a single base model served together with a concise LoRA; during decoding we modulate the adapter strength to switch between short-form and long-form reasoning. When token-level uncertainty exceeds a threshold, we expand a local uncertain window and regenerate it in long-form mode; once confidence recovers, adapter strength is annealed back and decoding proceeds in the concise mode.
  • Figure 3: MixReasoning and Long-to-short reasoning(prompting han2024token, finetuning(CoT-Valve ma2025cot)) results on GSM8K dataset with QwQ-32B-Preview, Qwen3-14B and Qwen3-8B at varing token budgets. MixReasoning can achieve a better trade-off bwteen efficiency and accuracy.
  • Figure 4: Qualitative comparison: Long CoT produces a verbose trace with coherence fillers and redundant self-checks. MixReasoning (small window) expands only at the high-uncertainty fork and then anneals back to concise mode, reaching the correct answer with a substantially shorter trace. MixReasoning (large window) allocates more detailed reasoning across adjacent steps, trading a larger budget for additional rationale while staying focused around the pivotal region. In MixReasoning responses, thinking mode tokens are highlight in red and non-thinking mode tokens are highlight in blue.
  • Figure 5: Layerwise LoRA ablation for reasoning-chain compression. Fine-tuning only MLP layers achieves token-length compression comparable to fine-tuning all layers, despite similar training convergence across configurations. In contrast, attention K/V–only adapters provide little compression, suggesting that knowledge governing reasoning-path length resides primarily in MLPs.