Table of Contents
Fetching ...

Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems

Taaha Kazi, Ruiliang Lyu, Sizhe Zhou, Dilek Hakkani-Tur, Gokhan Tur

TL;DR

Evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts, and proposed methodologies for the automatic evaluation of TOD models within this dynamic framework are proposed.

Abstract

Traditionally, offline datasets have been used to evaluate task-oriented dialogue (TOD) models. These datasets lack context awareness, making them suboptimal benchmarks for conversational systems. In contrast, user-agents, which are context-aware, can simulate the variability and unpredictability of human conversations, making them better alternatives as evaluators. Prior research has utilized large language models (LLMs) to develop user-agents. Our work builds upon this by using LLMs to create user-agents for the evaluation of TOD systems. This involves prompting an LLM, using in-context examples as guidance, and tracking the user-goal state. Our evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts. Additionally, we propose methodologies for the automatic evaluation of TOD models within this dynamic framework.

Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems

TL;DR

Evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts, and proposed methodologies for the automatic evaluation of TOD models within this dynamic framework are proposed.

Abstract

Traditionally, offline datasets have been used to evaluate task-oriented dialogue (TOD) models. These datasets lack context awareness, making them suboptimal benchmarks for conversational systems. In contrast, user-agents, which are context-aware, can simulate the variability and unpredictability of human conversations, making them better alternatives as evaluators. Prior research has utilized large language models (LLMs) to develop user-agents. Our work builds upon this by using LLMs to create user-agents for the evaluation of TOD systems. This involves prompting an LLM, using in-context examples as guidance, and tracking the user-goal state. Our evaluation of diversity and task completion metrics for the user-agents shows improved performance with the use of better prompts. Additionally, we propose methodologies for the automatic evaluation of TOD models within this dynamic framework.

Paper Structure

This paper contains 22 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: User Simulator Framework
  • Figure 2: An Example of Conversations Generated