Automatic Evaluation and Moderation of Open-domain Dialogue Systems
Chen Zhang, João Sedoc, Luis Fernando D'Haro, Rafael Banchs, Alexander Rudnicky
TL;DR
The paper presents Track 5 of the DSTC10 challenge, addressing two core problems in open-domain dialogue systems: automatic evaluation metrics that align with human judgments and safe chatbot development to handle toxic user inputs. It surveys a large, multi-source dataset suite (development and hidden test sets) and reports results from nine teams applying diverse, ensemble-based evaluation metrics, highlighting strong performers and generalization gaps. It also details a toxicity-moderation pipeline based on four datasets, annotation studies, and baseline generation of safe responses, noting annotation challenges with low inter-annotator agreement and mixed objective-semantic metric outcomes. Overall, the work advances automatic evaluation methodology for dialogue systems and establishes a foundation for safer, more responsible open-domain chatbots, while outlining concrete avenues for dataset unification, dialogue-level annotations, and a two-part safety task in future work.
Abstract
The development of Open-Domain Dialogue Systems (ODS)is a trending topic due to the large number of research challenges, large societal and business impact, and advances in the underlying technology. However, the development of these kinds of systems requires two important characteristics:1) automatic evaluation mechanisms that show high correlations with human judgements across multiple dialogue evaluation aspects (with explainable features for providing constructive and explicit feedback on the quality of generative models' responses for quick development and deployment)and 2) mechanisms that can help to control chatbot responses,while avoiding toxicity and employing intelligent ways to handle toxic user comments and keeping interaction flow and engagement. This track at the 10th Dialogue System Technology Challenge (DSTC10) is part of the ongoing effort to promote scalable and toxic-free ODS. This paper describes the datasets and baselines provided to participants, as well as submission evaluation results for each of the two proposed subtasks.
