Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities

Honglin Mu; Yang Xu; Yunlong Feng; Xiaofeng Han; Yitong Li; Yutai Hou; Wanxiang Che

Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities

Honglin Mu, Yang Xu, Yunlong Feng, Xiaofeng Han, Yitong Li, Yutai Hou, Wanxiang Che

TL;DR

This work proposes Automated Dynamic Evaluation (AutoDE) to assess an assistant’s API call capability without human involvement, and endeavors to closely mirror genuine human conversation patterns in human-machine interactions, using a LLM-based user agent equipped with a user script to ensure human alignment.

Abstract

With the rise of Large Language Models (LLMs), AI assistants' ability to utilize tools, especially through API calls, has advanced notably. This progress has necessitated more accurate evaluation methods. Many existing studies adopt static evaluation, where they assess AI assistants' API call based on pre-defined dialogue histories. However, such evaluation method can be misleading, as an AI assistant might fail in generating API calls from preceding human interaction in real cases. Instead of the resource-intensive method of direct human-machine interactions, we propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement. In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions, using a LLM-based user agent, equipped with a user script to ensure human alignment. Experimental results highlight that AutoDE uncovers errors overlooked by static evaluations, aligning more closely with human assessment. Testing four AI assistants using our crafted benchmark, our method further mirrored human evaluation compared to conventional static evaluations.

Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities

TL;DR

Abstract

Paper Structure (41 sections, 6 equations, 3 figures, 2 tables)

This paper contains 41 sections, 6 equations, 3 figures, 2 tables.

Introduction
Preliminary
Method
Evaluation Framework
Manual Evaluation
Evaluation Procedure
Human Annotators
Static Evaluation
Automated Dynamic Evaluation
Dataset Construction
API Document Construction
User Script Generation
Static Dialogue History Generation
Experimental Setup
User Agent Model
...and 26 more sections

Figures (3)

Figure 1: An illustration of our framework, where the user script encompasses both the dialogue context (Background) and the API call label.
Figure 2: An illustrative example for static evaluation, human evaluation and AutoDE, respectively. Sub-figure A shows the AI assistant correctly invoking an API call from a pre-defined dialogue history. In sub-figure B, the same assistant misses the "appName" parameter during human interaction, resulting in an incorrect API call. Sub-figure C demonstrates similar parameter issues when the assistant interacts with the user agent. We demonstrate that certain API call issues related to interaction, concealed by static evaluation, can be revealed by dynamic human evaluation and AutoDE.
Figure 3: Consistency between human evaluation results (F1 score) and those from various automated evaluation methods on four AI assistants.

Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities

TL;DR

Abstract

Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities

Authors

TL;DR

Abstract

Table of Contents

Figures (3)