Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Clemencia Siro; Mohammad Aliannejadi; Maarten de Rijke

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Clemencia Siro, Mohammad Aliannejadi, Maarten de Rijke

TL;DR

This work investigates how user feedback, operationalized via a user's follow-up utterance, affects turn-level evaluation of task-oriented dialogue systems. It compares crowdworkers and LLM annotators across four qualities (relevance, usefulness, interestingness, explanation quality) under two conditions: with and without follow-up utterances, using a ReDial subset. Findings show that follow-up feedback shifts ratings differently for humans versus LLMs, with crowdworkers showing stronger personalization in usefulness and interestingness and LLMs displaying distinct sensitivity patterns; agreement among annotators improves in ambiguous cases when follow-up is present. The study highlights the potential of integrating user feedback into automated evaluation and advocates combining human and LLM annotators to achieve more reliable dialogue-system assessments, while publicly releasing annotated data to spur further research.

Abstract

In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

TL;DR

Abstract

Paper Structure (18 sections, 6 figures, 2 tables)

This paper contains 18 sections, 6 figures, 2 tables.

Introduction
Related Work
User feedback
Bias in crowdsourcing evaluation labels
LLMs as annotators
The Annotation Task
Dialogue qualities
Data
Annotation scale
Preliminary experiments
Experimental conditions
Human annotators
LLM as annotator
Crowdsourced Judgments
Effect of user feedback
...and 3 more sections

Figures (6)

Figure 1: A dialogue showing an example of a complex user request with (right) and without (left) the user feedback. The star ratings show the assessment of external assessors judging the usefulness of the system utterance. As can be seen, based on the follow-up utterance the assessors lower their usefulness rating aligning with the user feedback.
Figure 2: A comparison of individual worker scores distributions for \ref{['SetupOne']} (left column) and \ref{['SetupTwo']} (right column).
Figure 3: KDE plots comparing aggregated crowdworker and LLM scores for both setups. The dotted lines represent the overall mean for each setup.
Figure 4: Mean rating for each aspect across the two setups, for both the crowdworkers and LLM.
Figure 5: Difference in scores assigned to dialogues turns for four aspects in Group 1 with low variability vs. dialogues in Group 2 with high variability between the worker scores from the mean rating.
...and 1 more figures

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

TL;DR

Abstract

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)