Table of Contents
Fetching ...

Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation

Tianyi Jiang, Yi Bin, Yujuan Ding, Kainian Zhu, Fei Ma, Jingkuan Song, Heng Tao Shen

TL;DR

This work tackles LLM overthinking by introducing Token Entropy Cumulative Average (TECA) as a metric to quantify cumulative exploration during reasoning and proposing Explore Briefly, Then Decide as a thinking paradigm. TECA-informed Cumulative Entropy Regulation (CER) is integrated into a GRPO-based reinforcement learning framework to selectively suppress excessive exploration while preserving necessary exploration, using a TECA-based reward $r_{te} = e^{-\mathrm{TECA}_{-1}} + 1$ and a segmented reward that activates only for correct answers. Empirical results on GSM8K, MATH500, and related benchmarks show substantial reductions in response length (up to 71%) with minimal or no loss in accuracy, outperforming prior methods like CoD and CCoT. The findings demonstrate that TECA can guide adaptive reasoning depth, enabling efficient, human-like exploration followed by decisive conclusions, with potential for broader applicability in complex problem solving and resource-constrained deployments.

Abstract

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm -- Explore Briefly, Then Decide -- with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.

Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation

TL;DR

This work tackles LLM overthinking by introducing Token Entropy Cumulative Average (TECA) as a metric to quantify cumulative exploration during reasoning and proposing Explore Briefly, Then Decide as a thinking paradigm. TECA-informed Cumulative Entropy Regulation (CER) is integrated into a GRPO-based reinforcement learning framework to selectively suppress excessive exploration while preserving necessary exploration, using a TECA-based reward and a segmented reward that activates only for correct answers. Empirical results on GSM8K, MATH500, and related benchmarks show substantial reductions in response length (up to 71%) with minimal or no loss in accuracy, outperforming prior methods like CoD and CCoT. The findings demonstrate that TECA can guide adaptive reasoning depth, enabling efficient, human-like exploration followed by decisive conclusions, with potential for broader applicability in complex problem solving and resource-constrained deployments.

Abstract

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm -- Explore Briefly, Then Decide -- with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.

Paper Structure

This paper contains 25 sections, 8 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Reasoning process comparison between the original and CER-trained large reasoning models. The original model continues to reflect for four times after the correct answer appears, while the CER-trained model determines the final answer after only one reflection. GREEN: correct answers (first and last shown); RED: reflecting words.
  • Figure 2: Token Entropy Cumulative Average curves in inference for four compared methods: Llama3.2-3B, Qwen2.5-14B and Qwen3-8B (without and with thinking versions). (a) and (b) correspond to one testing sample and (c) and (d) show the average results of 1000 samples, all from GSM8K dataset. The yellow star marks the step where the correct answer first appears.
  • Figure 3: Inference token length curve through CER training on two LLMs. Left: Length Clip Ratio denoting the proportion of responses exceeding the max length of the model; Middle: Average Length of all responses in a group; Right: Minimum Length of all responses in a group.
  • Figure 4: Token Entropy and TECA curves in inference for the reasoning models without and with our CER training. Left: average results for 1000 samples; Right: one case results.
  • Figure 5: A case to compare the responses generated by the original LRM and the CER-trained model. RED: reflecting words; GREEN: correct answer first and last shown.
  • ...and 2 more figures