VALLR: Visual ASR Language Model for Lip Reading
Marshall Thomas, Edward Fish, Richard Bowden
TL;DR
This paper tackles lip-reading as Visual Automatic Speech Recognition by introducing a phoneme-centric two-stage framework. It first learns a compact phoneme sequence from video using a ViT-based encoder with temporal downsampling and a CTC head, then reconstructs sentences with a fine-tuned Large Language Model (LLM) trained via LoRA on text data. The approach achieves state-of-the-art Word Error Rates on LRS3 (as low as 18.7%) and strong performance on LRS2, using only about 30 hours of labeled video data and without large-scale pretraining, highlighting data efficiency and interpretability due to an explicit intermediate phoneme representation. This modular design reduces error propagation across modalities and leverages linguistic context for sentence-level corrections, with practical implications for accessibility and privacy in noisy or restricted environments.
Abstract
Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.
