Table of Contents
Fetching ...

Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints

Giampiero Salvi

TL;DR

This work tackles low-latency phonetic speech recognition for real-time lip-synchronisation of a talking avatar by combining ANN/HMM models and tightly constraining look-ahead. It systematically varies language-model time dependencies, neural-time dynamics, and decoder look-ahead to study their interactions, revealing that the benefits of time-evolving neural representations depend strongly on LM structure and latency. Using two LM designs (alpha and wordlen tests) and three MLP topologies (ANN, RNN1, RNN2), the study shows recurrent models generally outperform static ones, with longer time dependencies and look-ahead helping when the LM is long, but sometimes providing diminishing returns due to overlapping information with the HMM. An entropy-based confidence measure is proposed to quantify frame-level reliability, supporting real-time decision-making for the avatar synthesis task. Overall, the findings offer guidelines for designing low-latency speech-to-articulation systems by balancing neural dynamics, decoder latency, and LM complexity.

Abstract

This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.

Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints

TL;DR

This work tackles low-latency phonetic speech recognition for real-time lip-synchronisation of a talking avatar by combining ANN/HMM models and tightly constraining look-ahead. It systematically varies language-model time dependencies, neural-time dynamics, and decoder look-ahead to study their interactions, revealing that the benefits of time-evolving neural representations depend strongly on LM structure and latency. Using two LM designs (alpha and wordlen tests) and three MLP topologies (ANN, RNN1, RNN2), the study shows recurrent models generally outperform static ones, with longer time dependencies and look-ahead helping when the LM is long, but sometimes providing diminishing returns due to overlapping information with the HMM. An entropy-based confidence measure is proposed to quantify frame-level reliability, supporting real-time decision-making for the avatar synthesis task. Overall, the findings offer guidelines for designing low-latency speech-to-articulation systems by balancing neural dynamics, decoder latency, and LM complexity.

Abstract

This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.
Paper Structure (22 sections, 7 equations, 14 figures, 4 tables)

This paper contains 22 sections, 7 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Dependencies in a first order HMM represented as a Bayesian network graph
  • Figure 2: Dependencies introduced by time dependent MLPs.
  • Figure 3: Interaction between the $\delta$ and $b$ terms in Viterbi decoding.
  • Figure 4: Trellis plot in with three Viterbi paths (varying look-ahead length)
  • Figure 5: Illustration of the "wordlen test" design: the transcription of each test utterance is spit into words of increasing lengths, that are used in the recognition network.
  • ...and 9 more figures