oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models

Muhammad Sudipto Siam Dip; Md Anik Hasan; Sapnil Sarker Bipro; Md Abdur Raiyan; Mohammod Abdul Motin

oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models

Muhammad Sudipto Siam Dip, Md Anik Hasan, Sapnil Sarker Bipro, Md Abdur Raiyan, Mohammod Abdul Motin

TL;DR

This technique efficiently aligns the sources of test and enrollment files, enhancing comparability and contribute to the success of RoboVox far-field speaker recognition in this paper.

Abstract

In this study, we address the challenge of speaker recognition using a novel data augmentation technique of adding noise to enrollment files. This technique efficiently aligns the sources of test and enrollment files, enhancing comparability. Various pre-trained models were employed, with the resnet model achieving the highest DCF of 0.84 and an EER of 13.44. The augmentation technique notably improved these results to 0.75 DCF and 12.79 EER for the resnet model. Comparative analysis revealed the superiority of resnet over models such as ECPA, Mel-spectrogram, Payonnet, and Titanet large. Results, along with different augmentation schemes, contribute to the success of RoboVox far-field speaker recognition in this paper

oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models

TL;DR

This technique efficiently aligns the sources of test and enrollment files, enhancing comparability and contribute to the success of RoboVox far-field speaker recognition in this paper.

Abstract

Paper Structure (10 sections, 2 figures, 2 tables)

This paper contains 10 sections, 2 figures, 2 tables.

Introduction
Methodology
Dataset Description
Preprocessing
Noise Reduction
Data augmentation with noise samples
Extracting the embedding
Result
Discussion
Conclusion

Figures (2)

Figure 1: An overall representation of our implemented framework. It starts by taking audio files from channel 5 and mixing them with noise extracted from audio files related to channel 4. The augmented enrollment signal is used to calculate the vector embedding set with the help of a large deep-learning model. Simultaneously the audio files from the test set are also fetched to calculate the embedding. Both of these are compared with the cosine dissimilarity evaluation metric
Figure 2: The step-by-step response for the implemented noise extraction algorithm. Where a) An audio file recorded with microphone 4, b) Binary mask obtained by determining voice activity intervals c) Result of multiplication binary mask with the audio signal d) Keeping non-zero segments i.e. noise.

oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models

TL;DR

Abstract

oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)