Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs
Jaeseong Lee, Dayoung Kwon, seung-won hwang
TL;DR
Gold-Switch targets overthinking in large reasoning models by introducing a training-free, low-rank unlearning module $L$ that, added to the slow-thinking LRM weights $W_R$, creates a controllable inference regime alongside a fast-thinking LLM. The method selects a per-layer rank $r$ via an energy-based criterion to approximate the overthinking component of $ abla W$ with a low-rank $oldsymbol{L}$, ensuring essential reasoning is preserved while reducing computation. It supports hard and soft superposition, with hard switching showing stronger practical gains, and demonstrates notable speedups (up to $2.7\times$) and memory reductions (up to $9\times$) on GSM8K, ASDIV, AIME, and GPQA across multiple baselines. Entropy-based rank selection outperforms fixed-ratio approaches, while soft-switching generally underperforms relative to hard-switching. The approach is training-free, deployment-friendly, and reduces reliance on dual-model routing, offering a practical path to efficient LRMs without retraining.
Abstract
Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.
