Learning API Functionality from In-Context Demonstrations for Tool-based Agents
Bhrij Patel, Ashish Jagmohan, Aditya Vempaty
TL;DR
The paper formalizes learning API functionality from in-context demonstrations in scenarios lacking reliable API documentation, modeling task completion as a goal-conditioned POMDP and introducing methods to derive textualizations from expert demonstrations and self-exploration experiences. It evaluates multiple demonstration-processing and experience-processing strategies (DxD, GD, GDEC, OD, DE, UD, RD, AG) across three benchmarks, showing that parameter information and correct parameter filling are crucial bottlenecks. Key findings include that explicit function calls and natural-language critiques improve task success, while hallucinations and mis-specified parameters significantly degrade performance; attaching guidelines can yield meaningful improvements, and simple fixes like pre-pending '#' to order_id can dramatically boost success rates. The work highlights significant challenges for state-of-the-art LLMs in documentation-free API understanding and points to future directions in better parameter-schema handling, suboptimal demo filtering, and group learning across APIs to reach reliable, self-improving tool-based agents.
Abstract
Digital tool-based agents, powered by Large Language Models (LLMs), that invoke external Application Programming Interfaces (APIs) often rely on documentation to understand API functionality. However, such documentation is frequently missing, outdated, privatized, or inconsistent-hindering the development of reliable, general-purpose agents. In this work, we propose a new research direction: learning of API functionality directly from in-context demonstrations. This task is a new paradigm applicable in scenarios without documentation. Using API benchmarks, we collect demonstrations from both expert agents and from self-exploration. To understand what information demonstrations must convey for successful task completion, we extensively study how the number of demonstrations and the use of LLM-generated summaries and evaluations affect the task success rate of the API-based agent. Our experiments across 3 datasets and 6 models show that learning functionality from in-context demonstrations remains a non-trivial challenge, even for state-of-the-art LLMs. We find that providing explicit function calls and natural language critiques significantly improves the agent's task success rate due to more accurate parameter filling. We analyze failure modes, identify sources of error, and highlight key open challenges for future work in documentation-free, self-improving, API-based agents.
