Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning
Mohammad Atif Quamar, Mohammad Areeb
TL;DR
LEASH addresses the high computational cost of chain-of-thought reasoning by introducing a training-free decoding heuristic that adaptively stops rationale generation. It monitors two intrinsic signals derived from logits, the local entropy slope $s_H$ and the top-logit margin improvement $\Delta M$, and halts when both plateau after a minimum rationale length, producing a concise final answer. Across four instruction-tuned LLMs on GSM8K and AQuA-RAT, LEASH achieves substantial token and latency reductions with a modest accuracy trade-off, demonstrating model-agnostic, training-free applicability and seamless integration with standard decoding. This approach offers a practical path to deploying reasoning-enabled LLMs under budget and latency constraints without architectural changes or extra supervision.
Abstract
Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30--35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.
