Table of Contents
Fetching ...

Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge

Ze Li, Xiaoxiao Miao, Juan Liu, Ming Li

TL;DR

Experimental results demonstrate that fine-tuning the large-scale pretrained model yields competitive performance, while language-adversarial training further enhances robustness and synthetic speech augmentation provides additional gains under limited training data conditions.

Abstract

Multilingual speaker verification (SV) remains challenging due to limited cross-lingual data and language-dependent information in speaker embeddings. This paper presents a language-invariant multilingual SV system for the TidyVoice 2026 Challenge. We adopt the multilingual self-supervised w2v-BERT 2.0 model as the backbone, enhanced with Layer Adapters and Multi-scale Feature Aggregation to better exploit multi-layer representations. A language-adversarial training strategy with a Gradient Reversal Layer is applied to promote language-invariant speaker embeddings. Moreover, a multilingual zero-shot text-to-speech system is used to synthesize speech in multiple languages, improving language diversity. Experimental results demonstrate that fine-tuning the large-scale pretrained model yields competitive performance, while language-adversarial training further enhances robustness. In addition, synthetic speech augmentation provides additional gains under limited training data conditions. Source code is available at https://github.com/ZXHY-82/LI-MSV-TidyVoice2026.

Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge

TL;DR

Experimental results demonstrate that fine-tuning the large-scale pretrained model yields competitive performance, while language-adversarial training further enhances robustness and synthetic speech augmentation provides additional gains under limited training data conditions.

Abstract

Multilingual speaker verification (SV) remains challenging due to limited cross-lingual data and language-dependent information in speaker embeddings. This paper presents a language-invariant multilingual SV system for the TidyVoice 2026 Challenge. We adopt the multilingual self-supervised w2v-BERT 2.0 model as the backbone, enhanced with Layer Adapters and Multi-scale Feature Aggregation to better exploit multi-layer representations. A language-adversarial training strategy with a Gradient Reversal Layer is applied to promote language-invariant speaker embeddings. Moreover, a multilingual zero-shot text-to-speech system is used to synthesize speech in multiple languages, improving language diversity. Experimental results demonstrate that fine-tuning the large-scale pretrained model yields competitive performance, while language-adversarial training further enhances robustness. In addition, synthetic speech augmentation provides additional gains under limited training data conditions. Source code is available at https://github.com/ZXHY-82/LI-MSV-TidyVoice2026.
Paper Structure (15 sections, 8 equations, 3 figures, 1 table)

This paper contains 15 sections, 8 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the w2v-BERT 2.0-based speaker verification system with language-invariant learning
  • Figure 2: Speech Synthesis Pipeline
  • Figure 3: t-SNE Visualization of Real and Synthetic Speech Embeddings for Speaker $id011337$.