Enhancing CTC-Based Visual Speech Recognition

Hendrik Laux; Anke Schmeink

Enhancing CTC-Based Visual Speech Recognition

Hendrik Laux, Anke Schmeink

TL;DR

LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.

Abstract

This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. These improvements yield substantial performance gains on the LRS2 and LRS3 benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model without increasing the volume of training data or computational resources utilized. Furthermore, we explore the scalability of our approach by examining performance metrics across varying model complexities and training data volumes. LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.

Enhancing CTC-Based Visual Speech Recognition

TL;DR

LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.

Abstract

Paper Structure (13 sections, 4 equations, 4 figures, 1 table)

This paper contains 13 sections, 4 equations, 4 figures, 1 table.

Introduction
Related Work
Contribution
Methodology
Model Architecture
Feature Normalization
Input Video Processing
Data
Training Details
Experimental Evaluation
Benchmark Results
Alignment of the Pre-Training Objective
Conclusion

Figures (4)

Figure 1: Updated LiteVSR Architecture with Feature Normalization.
Figure 2: Audio feature statistics for the LRS2-pretrain dataset. The figure shows the mean (black lines) and standard deviation (blue bars) of features produced by the 8th Conformer layer of the stt_en_conformer_ctc_small model. These outputs from $B_a$ are used as targets for training $B_v$. For readability, only a subset of the 176 features is shown
Figure 3: Distribution of selected features from Figure \ref{['fig:featstats']}.
Figure 4: Scatter plot showing the relation between the encoding loss $\mathcal{L}_{\text{enc}}$ and the WER metric (upper plot) / the CTC loss (lower plot) after pre-training. We feed the silent frames of each video to the pre-trained visual base to obtain the visual feature representation $B_v(\mathbf{x}_v)$. We use these features to obtain the encoding loss between audio and visual features and let $H_a$ transcribe the visual features to calculate the CTC loss and WER for each sample of the (unseen) LRS2 and LRS3 test sets. The black line indicates the trend-line obtained from using linear regression on the data.

Enhancing CTC-Based Visual Speech Recognition

TL;DR

Abstract

Enhancing CTC-Based Visual Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)