Table of Contents
Fetching ...

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Haoming Jiang, Bo Dai, Mengjiao Yang, Tuo Zhao, Wei Wei

TL;DR

The paper tackles the problem of automatic, scalable evaluation of dialog systems in interactive settings, where traditional static metrics poorly correlate with human judgments. It introduces ENIGMA, a model-free off-policy evaluation framework that uses pseudo-state padding to handle varying dialog horizons, a DICE-based density-ratio estimator to remain agnostic to behavior policies, and RoBERTa-based representations to manage the combinatorial state-action space. Through experiments on goal-oriented AirDialog and open-domain ConvAI2 data, ENIGMA achieves substantially higher correlation with human evaluations than BLEU/PP/LSTDQ-based approaches and self-play, demonstrating robustness to data coverage issues and sparse rewards. The work highlights the value of public human-model interaction data for advancing automatic dialog evaluation and outlines directions for extending ENIGMA to off-policy improvement.

Abstract

Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

TL;DR

The paper tackles the problem of automatic, scalable evaluation of dialog systems in interactive settings, where traditional static metrics poorly correlate with human judgments. It introduces ENIGMA, a model-free off-policy evaluation framework that uses pseudo-state padding to handle varying dialog horizons, a DICE-based density-ratio estimator to remain agnostic to behavior policies, and RoBERTa-based representations to manage the combinatorial state-action space. Through experiments on goal-oriented AirDialog and open-domain ConvAI2 data, ENIGMA achieves substantially higher correlation with human evaluations than BLEU/PP/LSTDQ-based approaches and self-play, demonstrating robustness to data coverage issues and sparse rewards. The work highlights the value of public human-model interaction data for advancing automatic dialog evaluation and outlines directions for extending ENIGMA to off-policy improvement.

Abstract

Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.

Paper Structure

This paper contains 30 sections, 2 theorems, 17 equations, 28 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

The augmented MDP with infinite horizon satisfies the following properties: $\bullet$ It has a unique stationary state-action visitation distribution $d^\pi(s,a)$; $\bullet$ For the state-action pair $(s_t,a_t)$ in a conversation $h$ with padded pseudo states, we have where $\{(s_k,a_k)\}_{k=1}^{t-1}$ are the state-action pairs in the same conversation as $(s_t,a_t)$; $\bullet$ The policy value c

Figures (28)

  • Figure 1: Dialog for booking a flight ticket (Airdialog).
  • Figure 2: Augmented MDP with Infinite Horizon.
  • Figure 3: Function Approximation with RoBERTa.
  • Figure 4: Regression Plots. The x-axis is the average reward obtained by chatting with human. The y-axis is the reward estimated by SPE / ENIGMA. Different colors denote different types of rewards (flight score, status score, and overall reward). The solid line is obtained by linear regression and the shaded region indicates $95\%$ confidence interval.
  • Figure 5: Value estimation using different methods for two target agents ($\pi_1$ and $\pi_2$) vs. # of iterations. Dotted lines denote the true rewards.
  • ...and 23 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Remark 1
  • Remark 2
  • Theorem 1
  • Example 1