Table of Contents
Fetching ...

Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

Jaeseong Lee, Dayoung Kwon, seung-won hwang

TL;DR

Gold-Switch targets overthinking in large reasoning models by introducing a training-free, low-rank unlearning module $L$ that, added to the slow-thinking LRM weights $W_R$, creates a controllable inference regime alongside a fast-thinking LLM. The method selects a per-layer rank $r$ via an energy-based criterion to approximate the overthinking component of $ abla W$ with a low-rank $oldsymbol{L}$, ensuring essential reasoning is preserved while reducing computation. It supports hard and soft superposition, with hard switching showing stronger practical gains, and demonstrates notable speedups (up to $2.7\times$) and memory reductions (up to $9\times$) on GSM8K, ASDIV, AIME, and GPQA across multiple baselines. Entropy-based rank selection outperforms fixed-ratio approaches, while soft-switching generally underperforms relative to hard-switching. The approach is training-free, deployment-friendly, and reduces reliance on dual-model routing, offering a practical path to efficient LRMs without retraining.

Abstract

Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.

Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

TL;DR

Gold-Switch targets overthinking in large reasoning models by introducing a training-free, low-rank unlearning module that, added to the slow-thinking LRM weights , creates a controllable inference regime alongside a fast-thinking LLM. The method selects a per-layer rank via an energy-based criterion to approximate the overthinking component of with a low-rank , ensuring essential reasoning is preserved while reducing computation. It supports hard and soft superposition, with hard switching showing stronger practical gains, and demonstrates notable speedups (up to ) and memory reductions (up to ) on GSM8K, ASDIV, AIME, and GPQA across multiple baselines. Entropy-based rank selection outperforms fixed-ratio approaches, while soft-switching generally underperforms relative to hard-switching. The approach is training-free, deployment-friendly, and reduces reliance on dual-model routing, offering a practical path to efficient LRMs without retraining.

Abstract

Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.

Paper Structure

This paper contains 39 sections, 11 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Overview of Gold-Switch. Gold-Switch extracts the overthinking parameters from the reasoning-enhanced model using low-rank approximation. It then dynamically adjusts the reasoning capabilities based on the difficulty of the input, allowing for efficient reasoning without retraining.
  • Figure 2: Average token length of GSM8K and ASDIV on QwQ
  • Figure 3: Average token length ratio of ASDIV on DeepSeek-R1-Distill-Qwen-32B
  • Figure 4: Case study of QwQ
  • Figure 5: Case study of QwQ+Nothinking
  • ...and 2 more figures