Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Anouck Braggaar; Christine Liebrecht; Emiel van Miltenburg; Emiel Krahmer

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Anouck Braggaar, Christine Liebrecht, Emiel van Miltenburg, Emiel Krahmer

TL;DR

This systematic review addresses how task-oriented dialogue systems, especially in customer-service contexts, are evaluated by mapping a broad landscape of constructs and metrics. It analyzes 122 studies across four major databases, organizing measures into intrinsic (NLU, NLG, performance) and system-in-context (task success, usability, user experience) dimensions, and highlights substantial variability and gaps in operationalisation and reporting. The authors discuss recent advances with large language models, including powering dialogue systems, red-teaming LLMs, and using LLMs to evaluate dialogue outputs, while voicing concerns about validity, reliability, and domain biases. They conclude with a research agenda advocating standardisation, better reporting practices, and triangulated evaluation approaches that combine human and automatic metrics to better reflect real-world customer-service performance and organisational impact. Overall, the work provides a comprehensive reference for researchers and practitioners to navigate evaluation choices, align measures with concrete goals, and push toward more reproducible and comparable evaluation practices in dialogue-system research.

Abstract

This review gives an extensive overview of evaluation methods for task-oriented dialogue systems, paying special attention to practical applications of dialogue systems, for example for customer service. The review (1) provides an overview of the used constructs and metrics in previous work, (2) discusses challenges in the context of dialogue system evaluation and (3) develops a research agenda for the future of dialogue system evaluation. We conducted a systematic review of four databases (ACL, ACM, IEEE and Web of Science), which after screening resulted in 122 studies. Those studies were carefully analysed for the constructs and methods they proposed for evaluation. We found a wide variety in both constructs and methods. Especially the operationalisation is not always clearly reported. Newer developments concerning large language models are discussed in two contexts: to power dialogue systems and to use in the evaluation process. We hope that future work will take a more critical approach to the operationalisation and specification of the used constructs. To work towards this aim, this review ends with recommendations for evaluation and suggestions for outstanding questions.

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

TL;DR

Abstract

Paper Structure (73 sections, 5 figures, 6 tables)

This paper contains 73 sections, 5 figures, 6 tables.

Introduction
Dialogue systems
Constructs and measurement
Why this survey?
Dialogue systems in the customer service domain
Previous overviews on evaluation of dialogue systems
Reading guide
Method
Databases and search queries
Paper selection
Data Extraction sheet
Data synthesis and grouping constructs
Additional papers
Results
Intrinsic evaluation
...and 58 more sections

Figures (5)

Figure 1: Simplified depiction of an interaction with a task-oriented dialogue system. The user comes to the system with a particular problem that they would like to solve. Through a series of messages to and responses from the dialogue system, both interlocutors work towards finding a resolution. Internally, messages are traditionally processed through a Natural Language Understanding (NLU) module, after which a dialogue manager updates the internal state of the system, and selects an appropriate response, which is then realised by the Natural Language Generation (NLG) module. (Icons via Freepik.com.)
Figure 2: Different measures (M1… M4) operationalising the same construct, capturing different aspects. We may obtain a fairly good coverage of the construct by combining different metrics, but some aspects may remain elusive.
Figure 3: PRISMA figure showing the selection process.
Figure 4: Bar graph showing the occurrence of papers within each year.
Figure 5: A general model of customers interacting with a chatbot that acts on behalf of an organisation. By interacting with the chatbot, customers form impressions and opinions about both the chatbot and the organisation. Some conversations cannot be handled by the chatbot alone, and should be handed over to a human agent who then responds to the customer. (Icons from Freepik.com.)

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

TL;DR

Abstract

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Authors

TL;DR

Abstract

Table of Contents

Figures (5)