VoxVietnam: a Large-Scale Multi-Genre Dataset for Vietnamese Speaker Recognition
Hoang Long Vu, Phuong Tuan Dat, Pham Thao Nhi, Nguyen Song Hao, Nguyen Thi Thu Trang
TL;DR
The paper addresses the challenge of multi-genre variability in Vietnamese speaker recognition by introducing VoxVietnam, the first large-scale, genre-diverse dataset for Vietnamese. It proposes a scalable, language-agnostic data-construction pipeline that automatically crawls public playlists, segments audio, clusters speakers, cleans labels with visual cues, merges speakers, and classifies utterance genres. VoxVietnam contains 187,980 utterances from 1,406 speakers across 261 hours and three genres (spontaneous, reading, singing), and experiments show that multi-genre training improves performance while single-genre training degrades on multi-genre tests; a visual-aid cleansing step further enhances data quality. The dataset is publicly available and is shown to boost robustness in practical, real-world scenarios, enabling further research into multi-genre effects in Vietnamese speaker recognition.
Abstract
Recent research in speaker recognition aims to address vulnerabilities due to variations between enrolment and test utterances, particularly in the multi-genre phenomenon where the utterances are in different speech genres. Previous resources for Vietnamese speaker recognition are either limited in size or do not focus on genre diversity, leaving studies in multi-genre effects unexplored. This paper introduces VoxVietnam, the first multi-genre dataset for Vietnamese speaker recognition with over 187,000 utterances from 1,406 speakers and an automated pipeline to construct a dataset on a large scale from public sources. Our experiments show the challenges posed by the multi-genre phenomenon to models trained on a single-genre dataset, and demonstrate a significant increase in performance upon incorporating the VoxVietnam into the training process. Our experiments are conducted to study the challenges of the multi-genre phenomenon in speaker recognition and the performance gain when the proposed dataset is used for multi-genre training.
