Table of Contents
Fetching ...

How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

Huda Khayrallah, Zuhaib Akhtar, Edward Cohen, Jyothir S, João Sedoc

TL;DR

This work creates and releases an 8-reference dialog dataset by extending single-reference evaluation sets and introduces this new language learning conversation dataset, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.

Abstract

We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog. Automatic metrics for dialogue evaluation should be robust proxies for human judgments; however, the verification of robustness is currently far from satisfactory. To quantify the robustness correlation and understand what is necessary in a test set, we create and release an 8-reference dialog dataset by extending single-reference evaluation sets and introduce this new language learning conversation dataset. We then train 1750 systems and evaluate them on our novel test set and the DailyDialog dataset. We release the novel test set, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.

How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation

TL;DR

This work creates and releases an 8-reference dialog dataset by extending single-reference evaluation sets and introduces this new language learning conversation dataset, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.

Abstract

We release MMSMR, a Massively Multi-System MultiReference dataset to enable future work on metrics and evaluation for dialog. Automatic metrics for dialogue evaluation should be robust proxies for human judgments; however, the verification of robustness is currently far from satisfactory. To quantify the robustness correlation and understand what is necessary in a test set, we create and release an 8-reference dialog dataset by extending single-reference evaluation sets and introduce this new language learning conversation dataset. We then train 1750 systems and evaluate them on our novel test set and the DailyDialog dataset. We release the novel test set, and model hyper parameters, inference outputs, and metric scores for each system on a variety of datasets.
Paper Structure (14 sections, 12 figures)

This paper contains 14 sections, 12 figures.

Figures (12)

  • Figure 1: Example from the ESL2 dataset including the 2-turn conversation snippet/prompt, actual continuation, references R1-8 and system outputs S1 - S18,630.
  • Figure 2: Spearman correlations between various metrics on the ESL3 test set. The bottom left includes all systems, the top right is the top ones.
  • Figure 3: The percent of data retained when thresholding on a percentile for any of the metrics. The dotted grey line shows the percentage that would be retained if all metrics were in perfect agreement.
  • Figure 4: Training command.
  • Figure 5: Correlations between various metrics on the ESL2 test set. The bottom left includes all systems, the top right is the top ones.
  • ...and 7 more figures