An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue Systems
Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze
TL;DR
The paper tackles the difficulty of objectively evaluating spoken dialogue systems by proposing a framework that infers system quality from observable user behaviors in social dialogue tasks. It analyzes the relationship between behavior features and subjective scores across Attentive Listening, Job Interview, and First-meeting Conversation using SHAP explanations and an XGBoost regression model, demonstrating predictive accuracy with leave-one-out cross-validation. Key findings show that utterance-related metrics and disfluencies inform evaluation in tasks where user speech dominates, while turn-taking cues matter more in highly interactive settings. This approach enables reproducible, objective comparison of SDSs by providing task-specific behavioral indicators linked to subjective assessments, with potential for multimodal extensions in the future.
Abstract
Establishing evaluation schemes for spoken dialogue systems is important, but it can also be challenging. While subjective evaluations are commonly used in user experiments, objective evaluations are necessary for research comparison and reproducibility. To address this issue, we propose a framework for indirectly but objectively evaluating systems based on users' behaviors. In this paper, to this end, we investigate the relationship between user behaviors and subjective evaluation scores in social dialogue tasks: attentive listening, job interview, and first-meeting conversation. The results reveal that in dialogue tasks where user utterances are primary, such as attentive listening and job interview, indicators like the number of utterances and words play a significant role in evaluation. Observing disfluency also can indicate the effectiveness of formal tasks, such as job interview. On the other hand, in dialogue tasks with high interactivity, such as first-meeting conversation, behaviors related to turn-taking, like average switch pause length, become more important. These findings suggest that selecting appropriate user behaviors can provide valuable insights for objective evaluation in each social dialogue task.
