Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

Pol Buitrago; Oriol Pareras; Federico Costa; Javier Hernando

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

Pol Buitrago, Oriol Pareras, Federico Costa, Javier Hernando

TL;DR

This work applies the CLTM to two paralinguistic tasks, gender identification and speaker verification, using a multilingual HuBERT-based encoder, to analyze how donor-language data affects target-language performance during fine-tuning.

Abstract

Paralinguistic speech tasks are often considered relatively language-agnostic, as they rely on extralinguistic acoustic cues rather than lexical content. However, prior studies report performance degradation under cross-lingual conditions, indicating non-negligible language dependence. Still, these studies typically focus on isolated language pairs or task-specific settings, limiting comparability and preventing a systematic assessment of task-level language dependence. We introduce the Cross-Lingual Transfer Matrix (CLTM), a systematic method to quantify cross-lingual interactions between pairs of languages within a given task. We apply the CLTM to two paralinguistic tasks, gender identification and speaker verification, using a multilingual HuBERT-based encoder, to analyze how donor-language data affects target-language performance during fine-tuning. Our results reveal distinct transfer patterns across tasks and languages, reflecting systematic, language-dependent effects.

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

TL;DR

Abstract

Paper Structure (15 sections, 5 equations, 5 figures, 2 tables)

This paper contains 15 sections, 5 equations, 5 figures, 2 tables.

Introduction
Method
Cross-Lingual Transfer Matrix (CLTM)
Dynamic Training Interval
Experimental setup
Data
Model and Training
Downstream Tasks
Results
CLTM Qualitative Analysis
CLTM Structural Analysis
SV: Embedding Geometry
Stability and Statistical Reliability of CLTM
Conclusions
Acknowledgements

Figures (5)

Figure 1: Typical learning curve for a single language, showing the dynamic interval and derivative regimes.
Figure 2: Learning curves for both tasks, showing performance as a function of training samples for a representative subset of languages. The task-specific dynamic interval $[N,2N]$ used to compute the CLTM is highlighted.
Figure 3: Architecture for gender recognition.
Figure 4: Speaker verification pipeline: SID training via a classification head, then embeddings are L2-normalized and compared with cosine similarity for verification.
Figure 5: Reduced CLTMs (16 representative languages) for gender recognition and speaker verification. Colors show how much adding donor-language data affects performance on a target language compared to adding the same amount of target-language data.

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

TL;DR

Abstract

Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)