ViSpeR: Multilingual Audio-Visual Speech Recognition
Sanath Narayan, Yasser Abdelaziz Dahou Djilali, Ankit Singh, Eustache Le Bihan, Hakim Hacid
TL;DR
ViSpeR tackles the scarcity of multilingual visual speech datasets by building a large-scale data pipeline and releasing a multilingual AVSR/VSR corpus across Chinese, Arabic, Spanish, and French (with English evaluation via LRS3/WildVSR). The pipeline combines YouTube keyword searches, a lightweight preselection classifier, scene-change detection, YOLOv5-face tracking, SyncNet active-speaker filtering, and Whisper-based transcription to produce language-tagged audio-visual clips segmented to 2–16 seconds, totaling over 3600 hours and roughly 3.2 million clips. Multilingual encoder–decoder baselines modeled after AutoAVSR demonstrate AVSR advantages over VSR across languages, though non-Latin languages remain more challenging due to accents and transcription noise. The work provides benchmarks for non-English VSR, shows feasibility of multilingual training, and outlines future directions including self-supervised multilingual VSR, translation across languages, and broader AVSR applications.
Abstract
This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language. The datasets and models are released to the community with an aim to serve as a foundation for triggering and feeding further research work and exploration on Audio-Visual Speech Recognition, an increasingly important area of research. Code available at \href{https://github.com/YasserdahouML/visper}{https://github.com/YasserdahouML/visper}.
