Table of Contents
Fetching ...

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, Genta Indra Winata

TL;DR

The paper tackles the difficulty of planning in multi-turn dialog scenarios that require coordinated use of multiple tools across diverse domains. It introduces T1, a tool-augmented, multi-domain dataset with an accompanying evaluation framework and T1-Agent, designed to simulate and benchmark inter-tool dependencies, memory caching, and dynamic replanning. The dataset spans nine domains, employs 14 tools, and includes 1,500 fully generated dialogues, enabling rigorous evaluation of planning and tool-use capabilities for both open-weight and proprietary LLMs. Experimental results show that domain adaptation via supervised fine-tuning substantially improves performance for smaller models, while larger models excel in certain tasks; overall, T1 serves as a diagnostic benchmark for advancing tool-augmented language agents and planning under realistic constraints.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-weight and proprietary large language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

TL;DR

The paper tackles the difficulty of planning in multi-turn dialog scenarios that require coordinated use of multiple tools across diverse domains. It introduces T1, a tool-augmented, multi-domain dataset with an accompanying evaluation framework and T1-Agent, designed to simulate and benchmark inter-tool dependencies, memory caching, and dynamic replanning. The dataset spans nine domains, employs 14 tools, and includes 1,500 fully generated dialogues, enabling rigorous evaluation of planning and tool-use capabilities for both open-weight and proprietary LLMs. Experimental results show that domain adaptation via supervised fine-tuning substantially improves performance for smaller models, while larger models excel in certain tasks; overall, T1 serves as a diagnostic benchmark for advancing tool-augmented language agents and planning under realistic constraints.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-weight and proprietary large language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.

Paper Structure

This paper contains 71 sections, 1 equation, 9 figures, 42 tables.

Figures (9)

  • Figure 1: Illustrative example from the $\textcolor{black}{T1}$ dataset. This example showcases a multi-domain scenario involving both flights and hotels, where the user is planning a trip and attempting to book relevant services. The dialogue is constructed by retrieving entities from a knowledge base, and tool calls are executed using a predefined toolbox, simulating realistic, tool-augmented agent behavior.
  • Figure 2: $\textcolor{black}{T1}$ generates data by populating delexicalized entities with corresponding entries from the knowledge base.
  • Figure 3: Left: Tool Call F1 and Right: Parameter Matching F1 on large dataset.
  • Figure 4: Few-shot performance on flight using Llama 3.3 70B Instruct on large dataset.
  • Figure 5: Tool Calling F1 score between open weight and proprietary models on small dataset.
  • ...and 4 more figures