Table of Contents
Fetching ...

Learning API Functionality from In-Context Demonstrations for Tool-based Agents

Bhrij Patel, Ashish Jagmohan, Aditya Vempaty

TL;DR

The paper formalizes learning API functionality from in-context demonstrations in scenarios lacking reliable API documentation, modeling task completion as a goal-conditioned POMDP and introducing methods to derive textualizations from expert demonstrations and self-exploration experiences. It evaluates multiple demonstration-processing and experience-processing strategies (DxD, GD, GDEC, OD, DE, UD, RD, AG) across three benchmarks, showing that parameter information and correct parameter filling are crucial bottlenecks. Key findings include that explicit function calls and natural-language critiques improve task success, while hallucinations and mis-specified parameters significantly degrade performance; attaching guidelines can yield meaningful improvements, and simple fixes like pre-pending '#' to order_id can dramatically boost success rates. The work highlights significant challenges for state-of-the-art LLMs in documentation-free API understanding and points to future directions in better parameter-schema handling, suboptimal demo filtering, and group learning across APIs to reach reliable, self-improving tool-based agents.

Abstract

Digital tool-based agents, powered by Large Language Models (LLMs), that invoke external Application Programming Interfaces (APIs) often rely on documentation to understand API functionality. However, such documentation is frequently missing, outdated, privatized, or inconsistent-hindering the development of reliable, general-purpose agents. In this work, we propose a new research direction: learning of API functionality directly from in-context demonstrations. This task is a new paradigm applicable in scenarios without documentation. Using API benchmarks, we collect demonstrations from both expert agents and from self-exploration. To understand what information demonstrations must convey for successful task completion, we extensively study how the number of demonstrations and the use of LLM-generated summaries and evaluations affect the task success rate of the API-based agent. Our experiments across 3 datasets and 6 models show that learning functionality from in-context demonstrations remains a non-trivial challenge, even for state-of-the-art LLMs. We find that providing explicit function calls and natural language critiques significantly improves the agent's task success rate due to more accurate parameter filling. We analyze failure modes, identify sources of error, and highlight key open challenges for future work in documentation-free, self-improving, API-based agents.

Learning API Functionality from In-Context Demonstrations for Tool-based Agents

TL;DR

The paper formalizes learning API functionality from in-context demonstrations in scenarios lacking reliable API documentation, modeling task completion as a goal-conditioned POMDP and introducing methods to derive textualizations from expert demonstrations and self-exploration experiences. It evaluates multiple demonstration-processing and experience-processing strategies (DxD, GD, GDEC, OD, DE, UD, RD, AG) across three benchmarks, showing that parameter information and correct parameter filling are crucial bottlenecks. Key findings include that explicit function calls and natural-language critiques improve task success, while hallucinations and mis-specified parameters significantly degrade performance; attaching guidelines can yield meaningful improvements, and simple fixes like pre-pending '#' to order_id can dramatically boost success rates. The work highlights significant challenges for state-of-the-art LLMs in documentation-free API understanding and points to future directions in better parameter-schema handling, suboptimal demo filtering, and group learning across APIs to reach reliable, self-improving tool-based agents.

Abstract

Digital tool-based agents, powered by Large Language Models (LLMs), that invoke external Application Programming Interfaces (APIs) often rely on documentation to understand API functionality. However, such documentation is frequently missing, outdated, privatized, or inconsistent-hindering the development of reliable, general-purpose agents. In this work, we propose a new research direction: learning of API functionality directly from in-context demonstrations. This task is a new paradigm applicable in scenarios without documentation. Using API benchmarks, we collect demonstrations from both expert agents and from self-exploration. To understand what information demonstrations must convey for successful task completion, we extensively study how the number of demonstrations and the use of LLM-generated summaries and evaluations affect the task success rate of the API-based agent. Our experiments across 3 datasets and 6 models show that learning functionality from in-context demonstrations remains a non-trivial challenge, even for state-of-the-art LLMs. We find that providing explicit function calls and natural language critiques significantly improves the agent's task success rate due to more accurate parameter filling. We analyze failure modes, identify sources of error, and highlight key open challenges for future work in documentation-free, self-improving, API-based agents.

Paper Structure

This paper contains 18 sections, 2 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Expert demonstration of the email.reply_email function extracted from WorkBench dataset stylesworkbench. Demonstrations are the basis of how agents understand functionality without prior documentation.
  • Figure 2: Processing Methods of Expert Demonstrations: Given a set of training demonstrations for each function $f$, we sample $N$ demonstrations for each function to be used for either (TOP) LLM-based documentation generation or (BOTTOM) to be directly passed into the agent.
  • Figure 3: Experience-based demonstration gained during self-exploration by $\pi$ for the same task in Figure \ref{['fig:dem_example']}. This experience includes the thought process and return of the reply_email function. Note that this is an incorrect use of reply_email. The "Evaluation of Demonstration" is shown in Figure \ref{['fig:evaluation_example']}.
  • Figure 4: LLM-Generated Evaluation of Self-Exploration Trajectory by $\pi$ for same task in Figures \ref{['fig:dem_example']} and \ref{['fig:experience_example']}. The agent incorrectly tries to reply email without searching for the email ID. It then corrects itself by executing search_email and then correctly using reply_email again. The first reply_id is formatted into the demonstration shown in Figure \ref{['fig:experience_example']}. The evaluation of the first reply_email call emphasizes that the agent should have first confirmed it had the right email ID. The evaluation for the second reply_email call states it was correctly used after the agent found the email ID with search_email. Each of these evaluations is added to the demonstration of their respective calls.
  • Figure 5: Summarized guidelines from experiences and evaluations of reply_email. The lesson emphasizes using search_email beforehand to find the right email_id to use for reply_email, which the agent sometimes did not do as shown in Figure \ref{['fig:evaluation_example']}.
  • ...and 9 more figures