CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing
Yewei Song, Xunzhu Tang, Cedric Lothritz, Saad Ezzini, Jacques Klein, Tegawendé F. Bissyandé, Andrey Boytsov, Ulrick Ble, Anne Goujon
TL;DR
CallNavi introduces a challenging benchmark for LLM-driven API function calling, featuring unfiltered API pools, multi-step routing, and nested dependencies across 10 domains. It combines automated generation with manual validation to create 729 questions over 579 APIs, and evaluates 18 models using AST-based outputs, LLM-as-a-Judge scoring, and a novel Stability Score to quantify consistency. A hybrid 2-step approach—using a general LLM for API selection and a fine-tuned model for parameter generation—along with backward inference thinking, substantially improves performance, especially on hard questions. The study reveals that even state-of-the-art models like GPT-4o exhibit limitations in long-context reasoning, structured output generation, and hallucination, underscoring the need for hybrid architectures, retrieval augmentation, and broader coverage of real-world constraints such as authentication and versioning. Overall, CallNavi provides a robust framework for evaluating and advancing AI-assisted software engineering tasks involving complex API function calling and routing.
Abstract
API-driven chatbot systems are increasingly integral to software engineering applications, yet their effectiveness hinges on accurately generating and executing API calls. This is particularly challenging in scenarios requiring multi-step interactions with complex parameterization and nested API dependencies. Addressing these challenges, this work contributes to the evaluation and assessment of AI-based software development through three key advancements: (1) the introduction of a novel dataset specifically designed for benchmarking API function selection, parameter generation, and nested API execution; (2) an empirical evaluation of state-of-the-art language models, analyzing their performance across varying task complexities in API function generation and parameter accuracy; and (3) a hybrid approach to API routing, combining general-purpose large language models for API selection with fine-tuned models and prompt engineering for parameter generation. These innovations significantly improve API execution in chatbot systems, offering practical methodologies for enhancing software design, testing, and operational workflows in real-world software engineering contexts.
