Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

Yuma Tsuta; Naoki Yoshinaga; Shoetsu Sato; Masashi Toyoda

Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

Yuma Tsuta, Naoki Yoshinaga, Shoetsu Sato, Masashi Toyoda

TL;DR

The paper tackles the mismatch between automatic evaluation of open-domain dialogue and judgments from the human interlocutor. It first demonstrates that interlocutor-aware evaluation requires personalization to the target user, using the Hazumi dataset to show outsider-based signals are insufficient and that a target-aware model achieves a high correlation (≈0.496) with interlocutor judgments. To scale this approach, the authors propose a dialogue continuity prediction (DCP) task to train an interlocutor-aware evaluator using large-scale X (Twitter) data, augmented with user profiles and speaker tokens. Empirical results show that DCP-based evaluators with interlocutor personalization can better align with human judgments on human responses, while accurately evaluating system responses remains a future challenge. The work advances automatic evaluation by framing it from the interlocutor's perspective and introducing practical, data-driven methods for personalization and supervision.

Abstract

Open-domain dialogue systems have started to engage in continuous conversations with humans. Those dialogue systems are required to be adjusted to the human interlocutor and evaluated in terms of their perspective. However, it is questionable whether the current automatic evaluation methods can approximate the interlocutor's judgments. In this study, we analyzed and examined what features are needed in an automatic response evaluator from the interlocutor's perspective. The first experiment on the Hazumi dataset revealed that interlocutor awareness plays a critical role in making automatic response evaluation correlate with the interlocutor's judgments. The second experiment using massive conversations on X (formerly Twitter) confirmed that dialogue continuity prediction can train an interlocutor-aware response evaluator without human feedback while revealing the difficulty in evaluating generated responses compared to human responses.

Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

TL;DR

Abstract

Paper Structure (21 sections, 3 figures, 5 tables)

This paper contains 21 sections, 3 figures, 5 tables.

Introduction
Related work
Automatic evaluation of dialogue systems
User-oriented NLP tasks
What is important to predict interlocutor evaluations?
Hazumi dialogue datasets
Analyze the effective cues in interlocutor score prediction
Models
Settings
Results
Towards Automatic Response Evaluation from Interlocutor's Eye
Interlocutor Evaluation via Personalized Dialogue Continuity Prediction (DCP)
How to consider the interlocutor in a model?
Experimental Setup
X (formerly Twitter) dialogue dataset
...and 6 more sections

Figures (3)

Figure 1: A discrepancy between interlocutor and outsider evaluations for open-domain dialogue systems.
Figure 2: Automatic response evaluation via dialogue continuity prediction from the interlocutor's perspective.
Figure 3: Result of dialogue continuity prediction task per user group split according to training sample size.

Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

TL;DR

Abstract

Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (3)