Table of Contents
Fetching ...

Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du, Dacheng Tao

Abstract

Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.

Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

Abstract

Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.
Paper Structure (38 sections, 5 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 5 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Performance comparison on BFCL patilberkeley with different candidate tool scales. We see that as the number of candidate tools increases, all models' performance degrades significantly, whereas our Tool-DC method can effectively mitigate this issue.
  • Figure 2: Overview of our Tool-DC framework. The training-free strategy employs a "Try-Check-Retry" pipeline to reduce the reasoning difficulty, while the training-based strategy leverages the prior reasoning trajectories to internalize this divide-and-conquer paradigm into model parameters via fine-tuning.
  • Figure 3: Illustration of Try stage in Tool-DC (TF). By splitting the total candidate tools into several parallel groups, Tool-DC (TF) can reduce the length of context and reasoning difficulty effectively.
  • Figure 4: Performance comparison of other LLMs with different training-free strategies on the BFCL benchmark. Here, we mainly compare our Tool-DC (TF) method with the All_Funs baseline.
  • Figure 5: Comparison between base models and tuned models using Tool-DC (TB) on Standard ACEBench.
  • ...and 6 more figures