Table of Contents
Fetching ...

End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2

Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra

TL;DR

The paper addresses the challenge of end-to-end Hindi-to-English speech translation to produce English audio. It presents an integrated pipeline combining fine-tuned XLSR Wav2Vec2 for Hindi ASR, mBART for translation, and Bark for English TTS. It discusses dataset choices based on Common Voice Hindi, detailing fine-tuning settings and WER results, and highlights the system's end-to-end flow and practical applications such as portable translation devices. The work advances cross-lingual speech interfaces and motivates further optimization for real-time use.

Abstract

Speech has long been a barrier to effective communication and connection, persisting as a challenge in our increasingly interconnected world. This research paper introduces a transformative solution to this persistent obstacle an end-to-end speech conversion framework tailored for Hindi-to-English translation, culminating in the synthesis of English audio. By integrating cutting-edge technologies such as XLSR Wav2Vec2 for automatic speech recognition (ASR), mBART for neural machine translation (NMT), and a Text-to-Speech (TTS) synthesis component, this framework offers a unified and seamless approach to cross-lingual communication. We delve into the intricate details of each component, elucidating their individual contributions and exploring the synergies that enable a fluid transition from spoken Hindi to synthesized English audio.

End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2

TL;DR

The paper addresses the challenge of end-to-end Hindi-to-English speech translation to produce English audio. It presents an integrated pipeline combining fine-tuned XLSR Wav2Vec2 for Hindi ASR, mBART for translation, and Bark for English TTS. It discusses dataset choices based on Common Voice Hindi, detailing fine-tuning settings and WER results, and highlights the system's end-to-end flow and practical applications such as portable translation devices. The work advances cross-lingual speech interfaces and motivates further optimization for real-time use.

Abstract

Speech has long been a barrier to effective communication and connection, persisting as a challenge in our increasingly interconnected world. This research paper introduces a transformative solution to this persistent obstacle an end-to-end speech conversion framework tailored for Hindi-to-English translation, culminating in the synthesis of English audio. By integrating cutting-edge technologies such as XLSR Wav2Vec2 for automatic speech recognition (ASR), mBART for neural machine translation (NMT), and a Text-to-Speech (TTS) synthesis component, this framework offers a unified and seamless approach to cross-lingual communication. We delve into the intricate details of each component, elucidating their individual contributions and exploring the synergies that enable a fluid transition from spoken Hindi to synthesized English audio.
Paper Structure (8 sections, 2 figures)

This paper contains 8 sections, 2 figures.

Figures (2)

  • Figure 1: WER (Word error rate)
  • Figure 2: Hindi to English speech conversion