Table of Contents
Fetching ...

ThinkBrake: Mitigating Overthinking in Tool Reasoning

Minjae Oh, Sangjun Song, Seungkyu Lee, Sungmin Jo, Yohan Jo

TL;DR

The paper addresses overthinking in small reasoning models during tool use, showing that SRMs often reach the correct tool–argument step but continue thinking and overwrite it. Using causal oracle rollouts that insert </think> at sentence boundaries, the authors demonstrate recoverable headroom and quantify a dramatic accuracy boost on BFCL when termination is appropriately timed. They introduce ThinkBrake, a training-free decoding heuristic that terminates thinking based on a log-probability margin at sentence boundaries, and show it preserves or improves accuracy while reducing token usage across BFCL splits and model scales. The work highlights practical gains in efficiency for tool reasoning, with ThinkBrake achieving strong performance across model sizes while reducing compute and paving the way for more robust decoding strategies.

Abstract

Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject </think> at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\% to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between </think> and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL's single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25\%, outperforming various baselines.

ThinkBrake: Mitigating Overthinking in Tool Reasoning

TL;DR

The paper addresses overthinking in small reasoning models during tool use, showing that SRMs often reach the correct tool–argument step but continue thinking and overwrite it. Using causal oracle rollouts that insert </think> at sentence boundaries, the authors demonstrate recoverable headroom and quantify a dramatic accuracy boost on BFCL when termination is appropriately timed. They introduce ThinkBrake, a training-free decoding heuristic that terminates thinking based on a log-probability margin at sentence boundaries, and show it preserves or improves accuracy while reducing token usage across BFCL splits and model scales. The work highlights practical gains in efficiency for tool reasoning, with ThinkBrake achieving strong performance across model sizes while reducing compute and paving the way for more robust decoding strategies.

Abstract

Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject </think> at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\% to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between </think> and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL's single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25\%, outperforming various baselines.

Paper Structure

This paper contains 15 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overthinking in tool use. The SRM first reaches the correct tool–argument configuration (roomType="suite with queen size bed"), but continued reasoning over-interprets the request (roomType="suite"), yielding a wrong call. Halting at that moment by injecting </think> preserves the correct call.
  • Figure 2: Proportion of overthinking vs. other errors across 119 incorrect cases over four categories in the BFCL single-turn non-live split.
  • Figure 4: Log-probability plot with two illustrative cases, $\Delta$P1 and $\Delta$P2. Here, $\Delta$P1 denotes a scenario where the most likely token has high probability (0.8), and $\Delta$P2 denotes a scenario where the most likely token has lower probability (0.4), while the raw gap is the same (i.e., $\Delta$P1 = $\Delta$P2).