Table of Contents
Fetching ...

LipNet: End-to-End Sentence-level Lipreading

Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, Nando de Freitas

TL;DR

LipNet introduces the first end-to-end, sentence-level lipreading model that maps sequences of mouth-region video frames to text using spatiotemporal convolutions, bidirectional GRUs, and the CTC loss. Trained entirely end-to-end, LipNet achieves 95.2% sentence-level accuracy on the GRID overlapped-speaker split, surpassing prior word-level methods and human lipreaders, and generalizes to unseen speakers with strong performance. The work includes extensive analysis of learned representations via saliency maps and viseme perturbations, showing the model attends to phonologically relevant mouth movements and that most errors occur within viseme groups. The paper demonstrates the value of end-to-end spatiotemporal feature learning for visual speech and suggests paths toward larger datasets and audio-visual extensions for robust speech recognition.

Abstract

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016).

LipNet: End-to-End Sentence-level Lipreading

TL;DR

LipNet introduces the first end-to-end, sentence-level lipreading model that maps sequences of mouth-region video frames to text using spatiotemporal convolutions, bidirectional GRUs, and the CTC loss. Trained entirely end-to-end, LipNet achieves 95.2% sentence-level accuracy on the GRID overlapped-speaker split, surpassing prior word-level methods and human lipreaders, and generalizes to unseen speakers with strong performance. The work includes extensive analysis of learned representations via saliency maps and viseme perturbations, showing the model attends to phonologically relevant mouth movements and that most errors occur within viseme groups. The paper demonstrates the value of end-to-end spatiotemporal feature learning for visual speech and suggests paths toward larger datasets and audio-visual extensions for robust speech recognition.

Abstract

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016).

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: LipNet architecture. A sequence of $T$ frames is used as input, and is processed by 3 layers of STCNN, each followed by a spatial max-pooling layer. The features extracted are processed by 2 Bi-GRUs; each time-step of the GRU output is processed by a linear layer and a softmax. This end-to-end model is trained with CTC.
  • Figure 2: Saliency maps for the words (a) please and (b) lay, produced by backpropagation to the input, showing the places where LipNet has learned to attend. The pictured transcription is given by greedy CTC decoding. CTC blanks are denoted by '␣'.
  • Figure 3: Intra-viseme and inter-viseme confusion matrices, depicting the three categories with the most confusions, as well as the confusions between viseme clusters. Colours are row-normalised to emphasise the errors.
  • Figure 4: LipNet's full phoneme confusion matrix.