MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
Pei Wang, Yanan Wu, Zekun Wang, Jiaheng Liu, Xiaoshuai Song, Zhongyuan Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, Junran Peng, Ge Zhang, Hangyu Guo, Zhaoxiang Zhang, Wenbo Su, Bo Zheng
TL;DR
MTU-Bench introduces MTU-Instruct for large-scale instruction tuning and MTU-Eval for automatic, fine-grained evaluation of tool-use in LLMs across multi-turn, multi-tool, and out-of-distribution scenarios. By synthesizing real-world task-oriented dialogues and providing a rich tool documentation framework, MTU-Bench yields 54k dialogues over 136 tools and a diverse set of evaluation metrics that do not rely on GPT or human judgments. The MTU-Eval framework, including normal/hard and OOD splits, enables comprehensive assessment of tool planning, tool selection, parameter accuracy, and the dynamics of tool-use across dialogue turns. Fine-tuning MTU-Instruct on MTU-Bench yields MTU-LLaMA, which demonstrates strong generalization, robustness to turn and tool count, and competitive performance with heavyweights like GPT-4 on several benchmarks, suggesting MTU-Bench as a valuable resource for advancing real-world tool-use in LLMs.
Abstract
Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics. Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs. Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench. Code and data will be released at https: //github.com/MTU-Bench-Team/MTU-Bench.git.
