Table of Contents
Fetching ...

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

Alexandros Haliassos, Rodrigo Mira, Honglie Chen, Zoe Landgraf, Stavros Petridis, Maja Pantic

TL;DR

It is demonstrated that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch, and a greedy pseudo-labelling approach is introduced to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods.

Abstract

Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/ahaliassos/usr.

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

TL;DR

It is demonstrated that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch, and a greedy pseudo-labelling approach is introduced to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods.

Abstract

Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/ahaliassos/usr.

Paper Structure

This paper contains 74 sections, 10 equations, 2 figures, 18 tables.

Figures (2)

  • Figure 1: Unified Speech Recognition. Our USR method combines self-supervised pre-training with semi-supervised fine-tuning. For semi-supervised training, pseudo-labels are generated from unmasked audiovisual features using an EMA (exponential moving average)-based teacher. The student, intaking masked inputs, predicts pseudo-labels for unlabelled data and ground-truth labels for labelled data. To obtain the pseudo-labels, an argmax operation is applied to the CTC and attention teacher output probabilities; the tokens with predicted probability below a fixed threshold are discarded. For self-supervised pre-training, a student encoder processes masked visual, auditory, and audiovisual samples and predicts targets, generated by an EMA-based teacher intaking unmasked audiovisual samples, via a shallow predictor. The targets are the average outputs of the teacher blocks. The resulting student weights are used to initialise the student and teacher in semi-supervised fine-tuning. Feature extraction is achieved through modality-specific feature extractors, whose features are concatenated along the channel dimension to produce the audiovisual inputs. The auditory, visual, and audiovisual student inputs are batched together for training efficiency.
  • Figure 2: Pseudo-label filtering threshold.Left: Validation plots for different values of threshold $\tau$. Right: Final WER for different values of $\tau$.