Custom Data Augmentation for low resource ASR using Bark and Retrieval-Based Voice Conversion
Anand Kamble, Aniket Tathe, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra
TL;DR
The paper addresses the challenge of expanding high-quality ASR data for low-resource languages like Hindi by proposing two complementary data augmentation pipelines. The first pipeline leverages Bark, enhanced with Meta EnCodec and a HuBert-based semantic alignment, to generate audio-codebook representations for customized datasets. The second pipeline uses Retrieval-Based Voice Conversion (RVC) with the Ozen toolkit to prepare and synthesize personalized voice data, including diarization and Hindi transcription. Together, these approaches demonstrate practical pathways to create tailored Common Voice-style datasets and support personalized voice generation, with Bark offering flexible semantic-token generation and RVC delivering low-noise, near-target voice outputs suitable for augmentation in low-resource ASR systems.
Abstract
This paper proposes two innovative methodologies to construct customized Common Voice datasets for low-resource languages like Hindi. The first methodology leverages Bark, a transformer-based text-to-audio model developed by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to enhance Bark's performance. The second methodology employs Retrieval-Based Voice Conversion (RVC) and uses the Ozen toolkit for data preparation. Both methodologies contribute to the advancement of ASR technology and offer valuable insights into addressing the challenges of constructing customized Common Voice datasets for under-resourced languages. Furthermore, they provide a pathway to achieving high-quality, personalized voice generation for a range of applications.
