Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART
Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra
TL;DR
The paper tackles the problem of building an ASR system for personalized Hindi voices with extremely limited data. It introduces a pipeline that creates a bespoke Hindi dataset via Retrieval-Based Voice Conversion from just 14 minutes of audio, then fine-tunes XLSR Wav2Vec2 for Hindi transcription and uses mBART for Hindi-to-English translation, all integrated into a Gradio GUI that outputs aligned English subtitles. The approach demonstrates that data augmentation and cross-lingual self-supervised representations can enable reasonably accurate transcription and translation from personalized speech in a low-resource language. The end-to-end system, including speaker diarization and video subtitle alignment, offers a practical solution for multilingual video content featuring personalized voice, with potential to extend to other low-resource languages and domains.
Abstract
This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-based GUI efficiently transcribes and translates input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns the translated text with the video timeline, delivering an accessible solution for multilingual video content transcription and translation for personalized voice.
