Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots

Ekaterina Svikhnushina; Pearl Pu

Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots

Ekaterina Svikhnushina, Pearl Pu

TL;DR

It is revealed that offline human evaluations fail to capture the subtleties of human-chatbot interactions as effectively as online assessments, and automated third-party evaluations using a GPT-4 model offer a better approximation of first-party human judgments given detailed instructions.

Abstract

This paper explores the efficacy of online versus offline evaluation methods in assessing conversational chatbots, specifically comparing first-party direct interactions with third-party observational assessments. By extending a benchmarking dataset of user dialogs with empathetic chatbots with offline third-party evaluations, we present a systematic comparison between the feedback from online interactions and the more detached offline third-party evaluations. Our results reveal that offline human evaluations fail to capture the subtleties of human-chatbot interactions as effectively as online assessments. In comparison, automated third-party evaluations using a GPT-4 model offer a better approximation of first-party human judgments given detailed instructions. This study highlights the limitations of third-party evaluations in grasping the complexities of user experiences and advocates for the integration of direct interaction feedback in conversational AI evaluation to enhance system development and user satisfaction.

Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots

TL;DR

Abstract

Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Introduction
Related Work
Materials and Methods
Dataset
Comparison of Results
Results
Annotation Experiment
Agreement Analysis for Offline Annotations
Comparison of Benchmarking Results in Offline and Online Settings
Comparison of Human and Automatic Offline Evaluation
Discussion and Conclusion

Figures (3)

Figure 1: Benchmarking results of the four chatbots. Light-grey traces show the results from the online evaluation setup while colored lines represent the offline setup.
Figure 2: Results of ordinal regression on rank. 95% confidence intervals are approximated as two standard errors. Light-grey traces show the results from the online evaluation setup while colored lines represent the offline setup.
Figure 3: Scatter plots of system-level correlation.

Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots

TL;DR

Abstract

Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots

Authors

TL;DR

Abstract

Table of Contents

Figures (3)