Table of Contents
Fetching ...

PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison

ChaeHun Park, Minseok Choi, Dohyun Lee, Jaegul Choo

TL;DR

PairEval introduces a reference-free, pairwise dialogue evaluation framework that judges a generated reply by comparing it to a small set of comparison responses using a moderate-size open LM. The method specializes an LM for pairwise comparisons via synthetic positive/negative training, including adversarial negatives, and mitigates prompt-order and position biases by averaging over orders and across examples. Across multiple meta-evaluation benchmarks, PairEval achieves strong human-alignment, often outperforming reference-based metrics and rivaling proprietary LLM-based evaluators, while demonstrating robustness to common dialogue failures and adversarial manipulations. The study highlights practical benefits for open-domain dialogue assessment and outlines avenues to improve efficiency, such as reducing the number of required comparisons and optimizing comparison selection.

Abstract

Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories. Although effective, these metrics evaluate individual responses directly rather than considering their relative quality compared to other responses. To handle this, we propose PairEval, a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. PairEval is built on top of open-sourced and moderate-size language models, and we make them specialized in pairwise comparison between dialogue responses. Extensive experiments on multiple benchmarks demonstrate that our metric exhibits a higher correlation with human judgments than baseline metrics. We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems, including repetition and speaker insensitivity.

PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison

TL;DR

PairEval introduces a reference-free, pairwise dialogue evaluation framework that judges a generated reply by comparing it to a small set of comparison responses using a moderate-size open LM. The method specializes an LM for pairwise comparisons via synthetic positive/negative training, including adversarial negatives, and mitigates prompt-order and position biases by averaging over orders and across examples. Across multiple meta-evaluation benchmarks, PairEval achieves strong human-alignment, often outperforming reference-based metrics and rivaling proprietary LLM-based evaluators, while demonstrating robustness to common dialogue failures and adversarial manipulations. The study highlights practical benefits for open-domain dialogue assessment and outlines avenues to improve efficiency, such as reducing the number of required comparisons and optimizing comparison selection.

Abstract

Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories. Although effective, these metrics evaluate individual responses directly rather than considering their relative quality compared to other responses. To handle this, we propose PairEval, a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. PairEval is built on top of open-sourced and moderate-size language models, and we make them specialized in pairwise comparison between dialogue responses. Extensive experiments on multiple benchmarks demonstrate that our metric exhibits a higher correlation with human judgments than baseline metrics. We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems, including repetition and speaker insensitivity.
Paper Structure (31 sections, 6 figures, 8 tables)

This paper contains 31 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The overall illustration of PairEval.
  • Figure 2: Responses in different types to finetune a LM in PairEval.
  • Figure 3: Scatter plots between human judgments and metric scores on the DailyDialog-Grade dataset. Each point indicates a response, and the x and y values of each point indicate denote human and metric scores, respectively. We add a noise sampled from $\mathcal{N}(0, 0.03^2)$ to human scores for better visualization. The red line indicates a linear regression.
  • Figure 4: Scatter plots between human judgments and metric scores on the DailyDialog-Zhao dataset. The indicators are the same as Fig. \ref{['fig:scatter']}.
  • Figure 5: Scatter plots between human judgments and metric scores on the TopicalChat-USR dataset. The indicators are the same as Fig. \ref{['fig:scatter']}.
  • ...and 1 more figures