Table of Contents
Fetching ...

ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog

Yujian Gan, Changling Li, Jinxia Xie, Luou Wen, Matthew Purver, Massimo Poesio

Abstract

We introduce ClarQ-LLM, an evaluation framework consisting of bilingual English-Chinese conversation tasks, conversational agents and evaluation metrics, designed to serve as a strong benchmark for assessing agents' ability to ask clarification questions in task-oriented dialogues. The benchmark includes 31 different task types, each with 10 unique dialogue scenarios between information seeker and provider agents. The scenarios require the seeker to ask questions to resolve uncertainty and gather necessary information to complete tasks. Unlike traditional benchmarks that evaluate agents based on fixed dialogue content, ClarQ-LLM includes a provider conversational agent to replicate the original human provider in the benchmark. This allows both current and future seeker agents to test their ability to complete information gathering tasks through dialogue by directly interacting with our provider agent. In tests, LLAMA3.1 405B seeker agent managed a maximum success rate of only 60.05\%, showing that ClarQ-LLM presents a strong challenge for future research.

ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog

Abstract

We introduce ClarQ-LLM, an evaluation framework consisting of bilingual English-Chinese conversation tasks, conversational agents and evaluation metrics, designed to serve as a strong benchmark for assessing agents' ability to ask clarification questions in task-oriented dialogues. The benchmark includes 31 different task types, each with 10 unique dialogue scenarios between information seeker and provider agents. The scenarios require the seeker to ask questions to resolve uncertainty and gather necessary information to complete tasks. Unlike traditional benchmarks that evaluate agents based on fixed dialogue content, ClarQ-LLM includes a provider conversational agent to replicate the original human provider in the benchmark. This allows both current and future seeker agents to test their ability to complete information gathering tasks through dialogue by directly interacting with our provider agent. In tests, LLAMA3.1 405B seeker agent managed a maximum success rate of only 60.05\%, showing that ClarQ-LLM presents a strong challenge for future research.
Paper Structure (43 sections, 3 equations, 21 figures, 5 tables)

This paper contains 43 sections, 3 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: An example of a dialogue background and task. In this task, the provider engages in dialogue with three seekers: S1, S2, and S3. Responses from S1 and S2, which both clarify required but previously unspecified information in different ways, are considered acceptable. However, response S3, which assumes uncertain knowledge, is not acceptable.
  • Figure 2: The complete dialogue of the task in Figure \ref{['figure:conv']}.
  • Figure 3: The provider response tree extracted from Figure \ref{['figure:conv-2']}.
  • Figure 4: The success rate of GPT-4 seeker on the breakdown of human provider response quantities. $\spadesuit$ stands for Chat mode, while $\heartsuit$ represents Completion mode.
  • Figure 5: Prompt for Chat mode.
  • ...and 16 more figures