Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Samuel Arcadinho; David Aparicio; Mariana Almeida

Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Samuel Arcadinho, David Aparicio, Mariana Almeida

TL;DR

This work tackles the challenge of evaluating tool-augmented LLMs as conversational AI agents by introducing an automated test-generation pipeline grounded in user-defined procedures. It uses intermediate graph representations (flowgraphs and conversation graphs) to ensure procedure-grounded, diverse conversations while curbing hallucinations, and it provides ALMITA, a manually curated customer-support dataset for end-to-end evaluation. Empirical results show strong single-turn performance and API-call accuracy across several models but reveal substantial gaps in maintaining correct, coherent conversations across complete interactions. The framework is general and extensible, enabling fully automated test generation (auto-ALMITA) and applicability to domains beyond customer support, with ALMITA serving as a public benchmark for future research.

Abstract

Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator's tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our method is general and capable of AI agents for different domains.

Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

TL;DR

Abstract

Paper Structure (29 sections, 5 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 5 figures, 3 tables, 1 algorithm.

Introduction
Related work
Method
Intent generator
Procedure generator
API extractor
Flowgraph generator
Conversation graph generator
Noise generator
Path sampler
Conversation generator
Test extractor
Results
Dataset generation: ALMITA
Ablation study: conversations from procedures
...and 14 more sections

Figures (5)

Figure 1: Automated test generation pipeline. For a given intent (e.g., cancel order) (1) we use an llm to generate a corresponding procedure. Then, (2) an llm extracts relevant APIs from the procedure, and (3) generates a flowgraph from the procedure and its APIs. Next, (4) an llm generates a conversation graph from the flowgraph and (5) adds noise to the graph (e.g., users going out of the expected procedure), to make the graph more realistic. To obtain conversations from the graph, (6) we sample paths from it, which correspond to different interactions. Finally, (7) an llm generates conversations from the paths and (8) we extract tests from the sampled conversations.
Figure 2: Flowgraph for intent Order not received and procedure "If the customer did not receive their order, allow the customer to cancel or refund their order given that they provide a correct order id". Blue nodes are message nodes, black nodes are API call nodes, orange nodes are end nodes. Edge labels are user messages or API outputs.
Figure 3: Conversation graph for flowgraph from Fig. \ref{['fig:flowgraph']} for intent Order not received. Blue nodes are agent nodes, green are user nodes, and black are API nodes.
Figure 4: Tests extracted from one conversation.
Figure 5: test correct value for different LLM Agents on the auto-ALMITA and ALMITA datasets.

Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

TL;DR

Abstract

Automated test generation to evaluate tool-augmented LLMs as conversational AI agents

Authors

TL;DR

Abstract

Table of Contents

Figures (5)