Table of Contents
Fetching ...

NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models

Han Han, Tong Zhu, Xiang Zhang, Mengsong Wu, Hao Xiong, Wenliang Chen

TL;DR

NesTools introduces a high-quality, large-scale benchmark for evaluating nested tool learning in large language models. It presents an automatic data-generation pipeline, followed by manual refinement, to create diverse, realistically structured nested tool calls across domains. The paper defines deterministic evaluation metrics for tool selection, order, parameter filling, and nested parameter filling, and demonstrates, across 22 LLMs, that deeper nesting and complex structures remain challenging despite model scaling. The results highlight the current limitations of both open-weight and proprietary models in handling nested tool chains, offering insights to guide future research toward robust, real-world tool agents.

Abstract

Large language models (LLMs) combined with tool learning have gained impressive results in real-world applications. During tool learning, LLMs may call multiple tools in nested orders, where the latter tool call may take the former response as its input parameters. However, current research on the nested tool learning capabilities is still under-explored, since the existing benchmarks lack relevant data instances. To address this problem, we introduce NesTools to bridge the current gap in comprehensive nested tool learning evaluations. NesTools comprises a novel automatic data generation method to construct large-scale nested tool calls with different nesting structures. With manual review and refinement, the dataset is in high quality and closely aligned with real-world scenarios. Therefore, NesTools can serve as a new benchmark to evaluate the nested tool learning abilities of LLMs. We conduct extensive experiments on 22 LLMs, and provide in-depth analyses with NesTools, which shows that current LLMs still suffer from the complex nested tool learning task.

NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models

TL;DR

NesTools introduces a high-quality, large-scale benchmark for evaluating nested tool learning in large language models. It presents an automatic data-generation pipeline, followed by manual refinement, to create diverse, realistically structured nested tool calls across domains. The paper defines deterministic evaluation metrics for tool selection, order, parameter filling, and nested parameter filling, and demonstrates, across 22 LLMs, that deeper nesting and complex structures remain challenging despite model scaling. The results highlight the current limitations of both open-weight and proprietary models in handling nested tool chains, offering insights to guide future research toward robust, real-world tool agents.

Abstract

Large language models (LLMs) combined with tool learning have gained impressive results in real-world applications. During tool learning, LLMs may call multiple tools in nested orders, where the latter tool call may take the former response as its input parameters. However, current research on the nested tool learning capabilities is still under-explored, since the existing benchmarks lack relevant data instances. To address this problem, we introduce NesTools to bridge the current gap in comprehensive nested tool learning evaluations. NesTools comprises a novel automatic data generation method to construct large-scale nested tool calls with different nesting structures. With manual review and refinement, the dataset is in high quality and closely aligned with real-world scenarios. Therefore, NesTools can serve as a new benchmark to evaluate the nested tool learning abilities of LLMs. We conduct extensive experiments on 22 LLMs, and provide in-depth analyses with NesTools, which shows that current LLMs still suffer from the complex nested tool learning task.

Paper Structure

This paper contains 37 sections, 1 equation, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Example of nested tool calling.
  • Figure 2: The dataset construction process of NesTools .
  • Figure 3: Model scaling results on NesTools .
  • Figure 4: The relation between nesting depth and Selection F1 among LLMs.
  • Figure 5: The averaged performance of different nesting structures. The arrow between two numbers indicates the nesting shape. For example, the first structure type (1$\rightarrow$2,1$\rightarrow$3) denotes that the 1st tool call's response contributes the input parameters for both the 2nd and the 3rd tool calls.
  • ...and 1 more figures