Table of Contents
Fetching ...

VALLR: Visual ASR Language Model for Lip Reading

Marshall Thomas, Edward Fish, Richard Bowden

TL;DR

This paper tackles lip-reading as Visual Automatic Speech Recognition by introducing a phoneme-centric two-stage framework. It first learns a compact phoneme sequence from video using a ViT-based encoder with temporal downsampling and a CTC head, then reconstructs sentences with a fine-tuned Large Language Model (LLM) trained via LoRA on text data. The approach achieves state-of-the-art Word Error Rates on LRS3 (as low as 18.7%) and strong performance on LRS2, using only about 30 hours of labeled video data and without large-scale pretraining, highlighting data efficiency and interpretability due to an explicit intermediate phoneme representation. This modular design reduces error propagation across modalities and leverages linguistic context for sentence-level corrections, with practical implications for accessibility and privacy in noisy or restricted environments.

Abstract

Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.

VALLR: Visual ASR Language Model for Lip Reading

TL;DR

This paper tackles lip-reading as Visual Automatic Speech Recognition by introducing a phoneme-centric two-stage framework. It first learns a compact phoneme sequence from video using a ViT-based encoder with temporal downsampling and a CTC head, then reconstructs sentences with a fine-tuned Large Language Model (LLM) trained via LoRA on text data. The approach achieves state-of-the-art Word Error Rates on LRS3 (as low as 18.7%) and strong performance on LRS2, using only about 30 hours of labeled video data and without large-scale pretraining, highlighting data efficiency and interpretability due to an explicit intermediate phoneme representation. This modular design reduces error propagation across modalities and leverages linguistic context for sentence-level corrections, with practical implications for accessibility and privacy in noisy or restricted environments.

Abstract

Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.

Paper Structure

This paper contains 22 sections, 9 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Comparison between different models' performances in WER for Visual Automatic Speech Recognition on the LRS3 dataset afouras2018lrs3 when compared with the amount of labelled training data. Circle size (green) denotes scale of pre-training data, while fine-tuned models (grey) are not pre-trained. Our model (orange) outperforms all existing approaches with just 30 hours of video training data, no self-supervised pre-training, and without the requirement for additional labeled visual data during fine-tuning.
  • Figure 2: An overview of our approach. First we extract facial regions from 16 frames of input video. We apply random pixel masking at 50% probability, add positional embedding, and encode visual features via a vision transformer encoder (ViT-base 1616). We implement temporal downsampling via 1D convolution and then use a CTC linear head to predict sequences of phonemes. During training, we also fine-tune an LLM to reconstruct sentences from phonemes with a text-only dataset CMU. During inference, the phonemes from the CTC head are processed via the LLM to reconstruct the predicted text. This can be performed end-to-end or in two stages, depending on available resources.
  • Figure 3: Example of the model's phonetic and sentence outputs from a sample in the LRS3 dataset afouras2018lrs3. The table illustrates the model’s ability to predict a sequence of phonemes from visual input, which are then reconstructed into a coherent sentence by the LLM. In this example all of the phonemes are predicted correctly and the words are recreated correctly.
  • Figure 4: Comparison of the model's phonetic and sentence outputs with ground truth from a sample in the LRS3 afouras2018lrs3 dataset. Red shows an incorrect prediction, M shows a missing prediction and green shows a correction from phonemes to words. In this example, even though the model incorrectly predicts certain phonemes, the LLM can correctly recreate the word but struggles to recreate homophones.
  • Figure 5: Confusion matrix showing the performance on isolated phonemes of the LRS3 afouras2018lrs3 dataset. We observe a very high match rate between the predicted phonemes and the ground truth. In red, we show the most difficult phonemes for our model to identify.
  • ...and 1 more figures