From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition
Rishabh Jain, Naomi Harte
TL;DR
This study critically assesses whether gains from integrating large language model decoders with visual speech encoders in VSR stem from enhanced visual understanding or from language modeling. By systematically varying decoder size, adaptation strategies, and training data while fixing the visual encoder across LRS2, LRS3, and WildVSR, it demonstrates that improvements are largely due to contextual reasoning in the language model rather than substantive visual feature enhancements. Across extensive ablations, decoder scaling and data composition yield limited gains, with the strongest performance achieved when training on a combined, diverse dataset; multiple lexical/semantic metrics show only modest benefits from the decoder. The findings suggest that advancing visual encoders—through large-scale self-supervised learning and Transformer-based visual front-ends—will be necessary for meaningful progress in VSR, while decoders mainly refine linguistic context rather than visually grounded lip-reading features.
Abstract
Advances in self-supervised encoders have improved Visual Speech Recognition (VSR). Recent approaches integrating these encoders with LLM decoders improves transcription accuracy; however, it remains unclear whether these gains stem from visual understanding or stronger language modeling. In this work, we systematically evaluate LLM decoders by freezing or selectively updating the visual encoder, scaling decoder size, comparing adaptation strategies and architectures, and varying training data across LRS2, LRS3, and their combination. Evaluation on LRS2, LRS3, and WildVSR shows that scaling and adaptation yield limited improvements, while combining datasets enhances generalization. Semantic analysis reveals that gains arise primarily from lexical rather than semantic processing. Our Llama-2-13B model trained on the combined set achieves 24.7% WER on LRS3 and 47.0% on WildVSR, establishing SOTA among models trained without additional supervision. Our findings indicate LLM decoders refine contextual reasoning rather than visual features, emphasizing the need for stronger visual encoders to drive meaningful progress.
