Table of Contents
Fetching ...

Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition

David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos

TL;DR

The paper investigates how conventional DNN-HMM decoders compare to state-of-the-art CTC/Attention decoders for continuous visual speech recognition across data-scarcity and language/domain-mismatch scenarios. It uses a shared visual speech encoder based on a Conformer backbone and evaluates both decoding paradigms on three datasets (LRS2-BBC, LRS3-TED, LIP-RTVE) with detailed training and decoding protocols, including lattice rescoring for the DNN-HMM and shallow fusion for CTC/Attention. The key finding is that the conventional DNN-HMM decoder often outperforms the CTC/Attention approach when data are scarce or visual features are non-ideal, with considerably lower training time and fewer parameters; as more data become available, the gap narrows, and end-to-end decoders can approach the performance of the traditional pipeline. The work highlights the robustness and practicality of DNN-HMM for data-scarce VSR and points to future directions in multilingual visual speech representations and domain transfer benchmarks.

Abstract

Thanks to the rise of deep learning and the availability of large-scale audio-visual databases, recent advances have been achieved in Visual Speech Recognition (VSR). Similar to other speech processing tasks, these end-to-end VSR systems are usually based on encoder-decoder architectures. While encoders are somewhat general, multiple decoding approaches have been explored, such as the conventional hybrid model based on Deep Neural Networks combined with Hidden Markov Models (DNN-HMM) or the Connectionist Temporal Classification (CTC) paradigm. However, there are languages and tasks in which data is scarce, and in this situation, there is not a clear comparison between different types of decoders. Therefore, we focused our study on how the conventional DNN-HMM decoder and its state-of-the-art CTC/Attention counterpart behave depending on the amount of data used for their estimation. We also analyzed to what extent our visual speech features were able to adapt to scenarios for which they were not explicitly trained, either considering a similar dataset or another collected for a different language. Results showed that the conventional paradigm reached recognition rates that improve the CTC/Attention model in data-scarcity scenarios along with a reduced training time and fewer parameters.

Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition

TL;DR

The paper investigates how conventional DNN-HMM decoders compare to state-of-the-art CTC/Attention decoders for continuous visual speech recognition across data-scarcity and language/domain-mismatch scenarios. It uses a shared visual speech encoder based on a Conformer backbone and evaluates both decoding paradigms on three datasets (LRS2-BBC, LRS3-TED, LIP-RTVE) with detailed training and decoding protocols, including lattice rescoring for the DNN-HMM and shallow fusion for CTC/Attention. The key finding is that the conventional DNN-HMM decoder often outperforms the CTC/Attention approach when data are scarce or visual features are non-ideal, with considerably lower training time and fewer parameters; as more data become available, the gap narrows, and end-to-end decoders can approach the performance of the traditional pipeline. The work highlights the robustness and practicality of DNN-HMM for data-scarce VSR and points to future directions in multilingual visual speech representations and domain transfer benchmarks.

Abstract

Thanks to the rise of deep learning and the availability of large-scale audio-visual databases, recent advances have been achieved in Visual Speech Recognition (VSR). Similar to other speech processing tasks, these end-to-end VSR systems are usually based on encoder-decoder architectures. While encoders are somewhat general, multiple decoding approaches have been explored, such as the conventional hybrid model based on Deep Neural Networks combined with Hidden Markov Models (DNN-HMM) or the Connectionist Temporal Classification (CTC) paradigm. However, there are languages and tasks in which data is scarce, and in this situation, there is not a clear comparison between different types of decoders. Therefore, we focused our study on how the conventional DNN-HMM decoder and its state-of-the-art CTC/Attention counterpart behave depending on the amount of data used for their estimation. We also analyzed to what extent our visual speech features were able to adapt to scenarios for which they were not explicitly trained, either considering a similar dataset or another collected for a different language. Results showed that the conventional paradigm reached recognition rates that improve the CTC/Attention model in data-scarcity scenarios along with a reduced training time and fewer parameters.
Paper Structure (16 sections, 2 equations, 2 figures, 3 tables)

This paper contains 16 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall architecture of our visual speech encoder. For simplicity, the initial layer normalization, the residual connection, and the final dropout of each module that compose the conformer encoder are omitted. Conv and FFN refer to Convolutional layer and Feed Forward Network, respectively.
  • Figure 2: Comparison in terms of performance (% WER) of the DNN-HMM and the CTC/Attention decoders based on the number of hours used to estimate both paradigms. The 9, 223, and 437 hours refers to the entire training set of the LIP-RTVE, LRS2-BBC, and LRS3-TED databases, respectively.