ThinkBrake: Mitigating Overthinking in Tool Reasoning
Minjae Oh, Sangjun Song, Seungkyu Lee, Sungmin Jo, Yohan Jo
TL;DR
The paper addresses overthinking in small reasoning models during tool use, showing that SRMs often reach the correct tool–argument step but continue thinking and overwrite it. Using causal oracle rollouts that insert </think> at sentence boundaries, the authors demonstrate recoverable headroom and quantify a dramatic accuracy boost on BFCL when termination is appropriately timed. They introduce ThinkBrake, a training-free decoding heuristic that terminates thinking based on a log-probability margin at sentence boundaries, and show it preserves or improves accuracy while reducing token usage across BFCL splits and model scales. The work highlights practical gains in efficiency for tool reasoning, with ThinkBrake achieving strong performance across model sizes while reducing compute and paving the way for more robust decoding strategies.
Abstract
Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject </think> at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\% to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between </think> and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL's single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25\%, outperforming various baselines.
