Table of Contents
Fetching ...

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang

TL;DR

This work introduces ThinkLogit, a decoding-time, training-free framework that elicits long chain-of-thought reasoning in a frozen large model by applying logit arithmetic from a small reasoning guider. A stronger variant, ThinkLogit-DPO, trains the guider with Direct Preference Optimization using mixed preferences from both the target and guider to better align long-CoT guidance with the target's correctness. Across five reasoning benchmarks, ThinkLogit achieves up to 24.5% relative improvements and ThinkLogit-DPO up to 29.1% over a frozen 32B target, with only a small 78M-parameter adapter fine-tuning required and no changes to the target's weights. The approach also generalizes across model families and can emulate reinforcement learning effects using RL-trained guiders, offering a practical, scalable alternative to expensive post-training for enabling long reasoning in large-scale models. These results demonstrate the viability of inference-time guidance as a flexible mechanism to enhance reasoning capabilities without full-model fine-tuning.

Abstract

Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

TL;DR

This work introduces ThinkLogit, a decoding-time, training-free framework that elicits long chain-of-thought reasoning in a frozen large model by applying logit arithmetic from a small reasoning guider. A stronger variant, ThinkLogit-DPO, trains the guider with Direct Preference Optimization using mixed preferences from both the target and guider to better align long-CoT guidance with the target's correctness. Across five reasoning benchmarks, ThinkLogit achieves up to 24.5% relative improvements and ThinkLogit-DPO up to 29.1% over a frozen 32B target, with only a small 78M-parameter adapter fine-tuning required and no changes to the target's weights. The approach also generalizes across model families and can emulate reinforcement learning effects using RL-trained guiders, offering a practical, scalable alternative to expensive post-training for enabling long reasoning in large-scale models. These results demonstrate the viability of inference-time guidance as a flexible mechanism to enhance reasoning capabilities without full-model fine-tuning.

Abstract

Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.

Paper Structure

This paper contains 34 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of our proposed ThinkLogit and ThinkLogit-DPO approaches to elicit long chain-of-thought reasoning from a large non-reasoning model that is frozen.
  • Figure 2: Comparison of ThinkLogit against two training-free long CoT elicitation baselines: budget forcing and one-shot long CoT in-context learning (ICL). While these approaches increase verbosity, their accuracies are generally lower and can even degrade, whereas ThinkLogit consistently produces longer reasoning that delivers the best performance.
  • Figure 3: Avg@8 for eliciting long CoT in a 72B target model with our methods. ThinkLogit-DPO delivers larger performance improvements on AIME2025 and AMC23 compared to ThinkLogit, demonstrating that preference signals learned on a 32B model transfer effectively to a larger 72B model in the same family.
  • Figure 4: Inference‑time scaling on AIME2025. Pass@ $k$ for $k\!=\!1\text{--}16$ comparing the target, guider, their direct logit fusion (ThinkLogit), and the DPO‑aligned fusion (ThinkLogit-DPO). Our methods demonstrate superior sample efficiency, reaching stronger performance with fewer generations and maintaining larger gains as the sample budget increases.
  • Figure 5: Impact of warm-up $T$ on ThinkLogit: early guidance ($T{=}0$) lowers accuracy and causes over-long, repetitive outputs, while moderate warm-up ($T{=}100$) gives the best performance with coherent long CoTs.
  • ...and 1 more figures