PARADISE: A Framework for Evaluating Spoken Dialogue Agents
Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, Alicia Abella
TL;DR
PARADISE tackles the problem of evaluating spoken dialogue agents across tasks by separating what needs to be achieved from how it is achieved in dialogue. It introduces a decision-theoretic performance function that combines a task-based success measure, $κ$, with multiple dialogue-cost metrics $c_i$, using user satisfaction to learn the weights via linear regression; costs and success are normalized with $N(x) = (x - \overline{x})/\sigma_x$. The framework relies on an Attribute-Value Matrix (AVM) task representation to allow task-general evaluation and supports calculating performance for subdialogues as well as whole dialogues. The results illustrate how different dialogue strategies can be evaluated and compared, and emphasize careful generalization and iterative model refinement for predictive power.
Abstract
This paper presents PARADISE (PARAdigm for DIalogue System Evaluation), a general framework for evaluating spoken dialogue agents. The framework decouples task requirements from an agent's dialogue behaviors, supports comparisons among dialogue strategies, enables the calculation of performance over subdialogues and whole dialogues, specifies the relative contribution of various factors to performance, and makes it possible to compare agents performing different tasks by normalizing for task complexity.
