Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge

Ze Li; Xiaoxiao Miao; Juan Liu; Ming Li

Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge

Ze Li, Xiaoxiao Miao, Juan Liu, Ming Li

TL;DR

Experimental results demonstrate that fine-tuning the large-scale pretrained model yields competitive performance, while language-adversarial training further enhances robustness and synthetic speech augmentation provides additional gains under limited training data conditions.

Abstract

Multilingual speaker verification (SV) remains challenging due to limited cross-lingual data and language-dependent information in speaker embeddings. This paper presents a language-invariant multilingual SV system for the TidyVoice 2026 Challenge. We adopt the multilingual self-supervised w2v-BERT 2.0 model as the backbone, enhanced with Layer Adapters and Multi-scale Feature Aggregation to better exploit multi-layer representations. A language-adversarial training strategy with a Gradient Reversal Layer is applied to promote language-invariant speaker embeddings. Moreover, a multilingual zero-shot text-to-speech system is used to synthesize speech in multiple languages, improving language diversity. Experimental results demonstrate that fine-tuning the large-scale pretrained model yields competitive performance, while language-adversarial training further enhances robustness. In addition, synthetic speech augmentation provides additional gains under limited training data conditions. Source code is available at https://github.com/ZXHY-82/LI-MSV-TidyVoice2026.

Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge

TL;DR

Abstract

Paper Structure (15 sections, 8 equations, 3 figures, 1 table)

This paper contains 15 sections, 8 equations, 3 figures, 1 table.

Introduction
Methods
Fine-tuning of the w2v-BERT 2.0 Pre-trained Model
Language-Invariant Speaker Representation Learning
Multilingual Synthetic Speech Data Augmentation
Experimental Setup
Datasets
Multilingual Synthetic Speech Generation
Training Details
Large-Scale Speaker Model Pre-training
Fine-tuning on TidyVoiceX Training Set with Language-Invariant Learning
Score Calibration
Results
Conclusion
Generative AI Use Disclosure

Figures (3)

Figure 1: Overview of the w2v-BERT 2.0-based speaker verification system with language-invariant learning
Figure 2: Speech Synthesis Pipeline
Figure 3: t-SNE Visualization of Real and Synthetic Speech Embeddings for Speaker $id011337$.

Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge

TL;DR

Abstract

Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge

Authors

TL;DR

Abstract

Table of Contents

Figures (3)