Table of Contents
Fetching ...

Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Oswald Zink, Yosuke Higuchi, Carlos Mullov, Alexander Waibel, Tetsunori Kobayashi

TL;DR

A function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance, is proposed, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio.

Abstract

Effective spoken dialog systems should facilitate natural interactions with quick and rhythmic timing, mirroring human communication patterns. To reduce response times, previous efforts have focused on minimizing the latency in automatic speech recognition (ASR) to optimize system efficiency. However, this approach requires waiting for ASR to complete processing until a speaker has finished speaking, which limits the time available for natural language processing (NLP) to formulate accurate responses. As humans, we continuously anticipate and prepare responses even while the other party is still speaking. This allows us to respond appropriately without missing the optimal time to speak. In this work, as a pioneering study toward a conversational system that simulates such human anticipatory behavior, we aim to realize a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance. To achieve this, we propose a training strategy for an encoder-decoder-based ASR system, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio. Additionally, we develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information to accurately detect the EOU. The experimental results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU. Moreover, the proposed training strategy exhibits general improvements in ASR performance.

Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

TL;DR

A function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance, is proposed, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio.

Abstract

Effective spoken dialog systems should facilitate natural interactions with quick and rhythmic timing, mirroring human communication patterns. To reduce response times, previous efforts have focused on minimizing the latency in automatic speech recognition (ASR) to optimize system efficiency. However, this approach requires waiting for ASR to complete processing until a speaker has finished speaking, which limits the time available for natural language processing (NLP) to formulate accurate responses. As humans, we continuously anticipate and prepare responses even while the other party is still speaking. This allows us to respond appropriately without missing the optimal time to speak. In this work, as a pioneering study toward a conversational system that simulates such human anticipatory behavior, we aim to realize a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance. To achieve this, we propose a training strategy for an encoder-decoder-based ASR system, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio. Additionally, we develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information to accurately detect the EOU. The experimental results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU. Moreover, the proposed training strategy exhibits general improvements in ASR performance.
Paper Structure (17 sections, 1 equation, 5 figures, 1 table)

This paper contains 17 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Schematic drawing of predictive tasks of interest. Given an utterance of which we mask $T_{\mathsf{mask}}$ milliseconds ahead of EOU and the trailing silence shown as $\phi$, predictive EOU detection tries to predict $t_{\mathsf{EOU}}$ based on the available audio information. The goal of predictive ASR is to generate words corresponding to the masked input (i.e., "you") based on the visible audio information and preceding tokens.
  • Figure 2: Distribution of silence duration $T - t_{\mathsf{EOU}}$ across development and test sets of Switchboard.
  • Figure 3: Proposed approach for detecting EOU based on cross-attention mechanism. It computes attention scores used for generating the final output, the end-of-sentence token ($\texttt{<EOS>}$). To identify the EOU, the upper boundary of the frames related to this final output is determined by comparing scores $a_t \in \mathbf{a}_4$ to the maximum score $a_{\mathsf{max}}$.
  • Figure 4: Absolute difference in EOU timing [ms] and FWER [%] on LS-100 test set, evaluated across different mask durations.
  • Figure 5: Absolute difference in EOU timing [ms] and FWER [%] on SWBD test set, evaluated across different mask durations.