Table of Contents
Fetching ...

Memory-Efficient Training for Text-Dependent SV with Independent Pre-trained Models

Seyed Ali Farokh, Hossein Zeinali

TL;DR

The paper tackles memory-intensive text-dependent speaker verification by decoupling phrase and speaker processing and leveraging independently pre-trained models with targeted domain adaptation. It shows that a phrase verification classifier based on XLSR and a speaker verification path using ResNet or Whisper-PMFA can achieve competitive MinDCF and EER with significantly reduced GPU requirements, using a two-stage training and a cosine scoring backend with AS-Norm. Key contributions include a bilingual phrase-adapted XLSR system, multiple memory-efficient SV models, and effective fusion that yields strong evaluation results in TdSV Challenge Task 1, particularly in resource-constrained settings. The approach reduces training memory and computational costs while maintaining strong verification performance, demonstrated by achieving first place in the Iranian division.

Abstract

This paper presents our submission to the Iranian division of the Text-Dependent Speaker Verification Challenge (TdSV) 2024. Conventional TdSV approaches typically jointly model speaker and linguistic features, requiring unsegmented inputs during training and incurring high computational costs. Additionally, these methods often fine-tune large-scale pre-trained speaker embedding models on the target domain dataset, which may compromise the pre-trained models' original ability to capture speaker-specific characteristics. To overcome these limitations, we employ a TdSV system that utilizes two pre-trained models independently and demonstrate that, by leveraging pre-trained models with targeted domain adaptation, competitive results can be achieved while avoiding the substantial computational costs associated with joint fine-tuning on unsegmented inputs in conventional approaches. Our best system reached a MinDCF of 0.0358 on the evaluation subset and secured first place in the challenge.

Memory-Efficient Training for Text-Dependent SV with Independent Pre-trained Models

TL;DR

The paper tackles memory-intensive text-dependent speaker verification by decoupling phrase and speaker processing and leveraging independently pre-trained models with targeted domain adaptation. It shows that a phrase verification classifier based on XLSR and a speaker verification path using ResNet or Whisper-PMFA can achieve competitive MinDCF and EER with significantly reduced GPU requirements, using a two-stage training and a cosine scoring backend with AS-Norm. Key contributions include a bilingual phrase-adapted XLSR system, multiple memory-efficient SV models, and effective fusion that yields strong evaluation results in TdSV Challenge Task 1, particularly in resource-constrained settings. The approach reduces training memory and computational costs while maintaining strong verification performance, demonstrated by achieving first place in the Iranian division.

Abstract

This paper presents our submission to the Iranian division of the Text-Dependent Speaker Verification Challenge (TdSV) 2024. Conventional TdSV approaches typically jointly model speaker and linguistic features, requiring unsegmented inputs during training and incurring high computational costs. Additionally, these methods often fine-tune large-scale pre-trained speaker embedding models on the target domain dataset, which may compromise the pre-trained models' original ability to capture speaker-specific characteristics. To overcome these limitations, we employ a TdSV system that utilizes two pre-trained models independently and demonstrate that, by leveraging pre-trained models with targeted domain adaptation, competitive results can be achieved while avoiding the substantial computational costs associated with joint fine-tuning on unsegmented inputs in conventional approaches. Our best system reached a MinDCF of 0.0358 on the evaluation subset and secured first place in the challenge.

Paper Structure

This paper contains 14 sections, 3 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: DET curves of our best-performing system.