Table of Contents
Fetching ...

Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions

David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos

TL;DR

The paper tackles end-to-end Spanish visual speech recognition under diverse data conditions by implementing a hybrid CTC/Attention model with a Conformer encoder and a language model. It introduces a robust Spanish lipreading benchmark comprising VLRF and LIP-RTVE alongside additional corpora to cover heterogeneous domains, accompanied by a thorough ablation and error analysis. The proposed architecture achieves state-of-the-art results on two Spanish datasets, with insights highlighting the critical roles of the CTC branch and the LM, and revealing Zipf's law implications for VSR performance. This work advances Spanish VSR and provides valuable resources for cross-domain evaluation and future audio-visual extensions.

Abstract

Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at https://github.com/david-gimeno/evaluating-end2end-spanish-lipreading.

Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions

TL;DR

The paper tackles end-to-end Spanish visual speech recognition under diverse data conditions by implementing a hybrid CTC/Attention model with a Conformer encoder and a language model. It introduces a robust Spanish lipreading benchmark comprising VLRF and LIP-RTVE alongside additional corpora to cover heterogeneous domains, accompanied by a thorough ablation and error analysis. The proposed architecture achieves state-of-the-art results on two Spanish datasets, with insights highlighting the critical roles of the CTC branch and the LM, and revealing Zipf's law implications for VSR performance. This work advances Spanish VSR and provides valuable resources for cross-domain evaluation and future audio-visual extensions.

Abstract

Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at https://github.com/david-gimeno/evaluating-end2end-spanish-lipreading.

Paper Structure

This paper contains 13 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Data preprocessing for VSR tasks involves identifying the speaker's face, detecting 68 facial landmarks, applying an affine transformation w.r.t. a neutral reference frame to remove translation and scaling variations, and finally extracting the region of interest centered on the speaker's mouth.
  • Figure 2: Architecture of the end-to-end VSR model based on the auto-regressive CTC/Attention paradigm, including design details both for its training and inference processes. For simplicity, the initial layer normalization, the residual connection, and the final dropout of each module that compose each layer of the Conformer- and Transformer-based modules are omitted. FFN, CE, and CTC refer to Feed-Forward Network, Cross Entropy, and Connectionist Temporal Classification, respectively.
  • Figure 3: Excerpts from original videos in the LIP-RTVE database, showcasing the diverse speakers and scenarios to be addressed in this challenging task.
  • Figure 4: System performance (%WER) histogram of the proposed Spanish databases.
  • Figure 5: Relationship between Zipf's law and the VLRF and LIP-RTVE databases. Similar behaviour was observed for the Spanish CMU-MOSEAS and MuAViC corpora, but they were omitted for clarity. SD and SI refer to a speaker-dependent and speaker-independent partition of the corresponding database, respectively. It should be noted that both dimensions are depicted in a logarithmic scale.