Table of Contents
Fetching ...

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Kangyun Ning, Yisong Su, Xueqiang Lv, Yuanzhe Zhang, Jian Liu, Kang Liu, Jinan Xu

TL;DR

WTU-Eval introduces the first benchmark to evaluate whether LLMs can decide if and when to use external tools, addressing real-world uncertainty in tool usage. The 11 datasets (6 tool-usage, 5 general) test tool-choice behavior under zero-shot and few-shot prompting with four tools, across eight LLMs, using ReACT-style prompts and rigorous accuracy and tool-call metrics. Results show that many models struggle to identify tool needs, and incorrect tool usage consistently harms performance, though appropriate tool use can help when model capabilities align with the task. A supervised fine-tuning regime using a 4000-example dataset substantially improves tool-use decision-making (e.g., +14% average accuracy and −16.8% incorrect tool use for Llama2-7B), reducing unnecessary tool invocations and enhancing efficiency. The benchmark and findings highlight a critical need to train LLMs not only to use tools, but to decide when tool usage is warranted, with practical implications for deploying robust, efficient AI systems.

Abstract

Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to explore whether LLMs can discern their ability boundaries and use tools flexibly. We then introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets, where six of them are tool-usage datasets, and five are general datasets. LLMs are prompted to use tools according to their needs. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets, and LLMs' performance in tool-usage datasets improves when their ability is similar to ChatGPT. In both datasets, incorrect tool usage significantly impairs LLMs' performance. To mitigate this, we also develop the finetuning dataset to enhance tool decision-making. Fine-tuning Llama2-7B results in a 14\% average performance improvement and a 16.8\% decrease in incorrect tool usage. We will release the WTU-Eval benchmark.

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

TL;DR

WTU-Eval introduces the first benchmark to evaluate whether LLMs can decide if and when to use external tools, addressing real-world uncertainty in tool usage. The 11 datasets (6 tool-usage, 5 general) test tool-choice behavior under zero-shot and few-shot prompting with four tools, across eight LLMs, using ReACT-style prompts and rigorous accuracy and tool-call metrics. Results show that many models struggle to identify tool needs, and incorrect tool usage consistently harms performance, though appropriate tool use can help when model capabilities align with the task. A supervised fine-tuning regime using a 4000-example dataset substantially improves tool-use decision-making (e.g., +14% average accuracy and −16.8% incorrect tool use for Llama2-7B), reducing unnecessary tool invocations and enhancing efficiency. The benchmark and findings highlight a critical need to train LLMs not only to use tools, but to decide when tool usage is warranted, with practical implications for deploying robust, efficient AI systems.

Abstract

Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to explore whether LLMs can discern their ability boundaries and use tools flexibly. We then introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets, where six of them are tool-usage datasets, and five are general datasets. LLMs are prompted to use tools according to their needs. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets, and LLMs' performance in tool-usage datasets improves when their ability is similar to ChatGPT. In both datasets, incorrect tool usage significantly impairs LLMs' performance. To mitigate this, we also develop the finetuning dataset to enhance tool decision-making. Fine-tuning Llama2-7B results in a 14\% average performance improvement and a 16.8\% decrease in incorrect tool usage. We will release the WTU-Eval benchmark.
Paper Structure (28 sections, 8 figures, 9 tables)

This paper contains 28 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: An example showing the failure of calling tools inappropriately.
  • Figure 2: Illustrative diagram depicting user interaction scenarios with and without access to tool pools. LLMs need to respond to the user's query in Region1 (R1) and Region3 (R3). In Region2 (R2) and Region4 (R4), LLMs must judge based on the nature of the task whether a tool is required. If so, the corresponding tool from the tool pool is invoked; if not, the answer is provided using its knowledge. If the judgment is correct, then the corresponding choice is highlighted in green; otherwise, it is in red.
  • Figure 3: Illustrative diagram depicting user interaction scenarios with LLMs in COT setting without the integration of a tool set.
  • Figure 4: Distribuion of Error Types in Tool-Usage and General Datasets with Zero-Shot Setting in Llama2-7B
  • Figure :
  • ...and 3 more figures