What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

Kiyotada Mori; Seiya Kawano; Chaoran Liu; Carlos Toshinori Ishi; Angel Fernando Garcia Contreras; Koichiro Yoshino

What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel Fernando Garcia Contreras, Koichiro Yoshino

TL;DR

The paper tackles the mismatch between traditional $WER$-based ASR evaluation and the needs of spoken dialogue systems by examining how humans selectively listen during dialogue generation. Through a large-scale experiment combining dialogue response production with post-hoc transcription in noisy environments, the authors show that humans prioritize content words and achieve higher semantic alignment with references than ASR, despite higher $WER$ overall. They introduce a POS-weighted metric, $H ext{-}WWER$, derived from regression weights over content vs. function words, demonstrating that human-centered evaluation can better capture the information relevant for dialogue responses. The work suggests that new ASR evaluation methods, informed by human selective listening, can more accurately reflect the utility of recognition for SDSs and lays groundwork for cross-language and neurophysiological extensions.

Abstract

Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.

What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

TL;DR

The paper tackles the mismatch between traditional

-based ASR evaluation and the needs of spoken dialogue systems by examining how humans selectively listen during dialogue generation. Through a large-scale experiment combining dialogue response production with post-hoc transcription in noisy environments, the authors show that humans prioritize content words and achieve higher semantic alignment with references than ASR, despite higher

overall. They introduce a POS-weighted metric,

, derived from regression weights over content vs. function words, demonstrating that human-centered evaluation can better capture the information relevant for dialogue responses. The work suggests that new ASR evaluation methods, informed by human selective listening, can more accurately reflect the utility of recognition for SDSs and lays groundwork for cross-language and neurophysiological extensions.

What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

TL;DR

Abstract

What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)