Table of Contents
Fetching ...

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

TL;DR

This work tackles Visual Speech Recognition for languages with scarce labeled data by automatically labeling unlabeled multilingual video through Whisper, expanding training data by about 1,002 hours across FR, IT, ES, and PT. The approach relies on Whisper for language identification and transcription, and combines the new auto-labeled data with a modest amount of human-labeled data to train a VSR model that outperforms prior state-of-the-art on the mTEDx benchmark. The results show that automated labeling can approach human-label performance and, in some cases, exceed it, underscoring the potential of data-centric strategies to advance multilingual VSR. The study also provides ablation analyses illustrating how increasing auto-labeled data size improves performance and releases the labeling resources for broader use.

Abstract

This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

TL;DR

This work tackles Visual Speech Recognition for languages with scarce labeled data by automatically labeling unlabeled multilingual video through Whisper, expanding training data by about 1,002 hours across FR, IT, ES, and PT. The approach relies on Whisper for language identification and transcription, and combines the new auto-labeled data with a modest amount of human-labeled data to train a VSR model that outperforms prior state-of-the-art on the mTEDx benchmark. The results show that automated labeling can approach human-label performance and, in some cases, exceed it, underscoring the potential of data-centric strategies to advance multilingual VSR. The study also provides ablation analyses illustrating how increasing auto-labeled data size improves performance and releases the labeling resources for broader use.

Abstract

This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
Paper Structure (16 sections, 1 figure, 5 tables)