Table of Contents
Fetching ...

Zero-Shot Multi-Lingual Speaker Verification in Clinical Trials

Ali Akram, Marija Stanojevic, Malikeh Ehghaghi, Jekaterina Novikova

TL;DR

The paper tackles the challenge of duplicate enrollment and cross-language identity verification in multinational clinical trials by applying zero-shot, text-independent speaker verification (SV) to speech from patients with cognitive/mental health disorders. It evaluates three pre-trained SV models (TitaNet, ECAPA-TDNN, SpeakerNet) on five languages (English, German, Danish, Spanish, Arabic) using longitudinal, clinically collected datasets, reporting $EER$ values that are competitive in cross-language settings without language-specific fine-tuning. Key findings indicate that European languages yield lower $EER$ (e.g., ~2.7%–2.9% range) while Arabic remains more challenging (~8.3%–8.4%), with performance influenced by speech task (Picture Description generally strongest) and dataset characteristics. The work demonstrates the feasibility of scalable, multilingual SV in clinical trials and provides guidance on task design and threshold calibration, while highlighting ethical considerations, potential biases, and future directions for broader language coverage and dialectal analysis.

Abstract

Due to the substantial number of clinicians, patients, and data collection environments involved in clinical trials, gathering data of superior quality poses a significant challenge. In clinical trials, patients are assessed based on their speech data to detect and monitor cognitive and mental health disorders. We propose using these speech recordings to verify the identities of enrolled patients and identify and exclude the individuals who try to enroll multiple times in the same trial. Since clinical studies are often conducted across different countries, creating a system that can perform speaker verification in diverse languages without additional development effort is imperative. We evaluate pre-trained TitaNet, ECAPA-TDNN, and SpeakerNet models by enrolling and testing with speech-impaired patients speaking English, German, Danish, Spanish, and Arabic languages. Our results demonstrate that tested models can effectively generalize to clinical speakers, with less than 2.7% EER for European Languages and 8.26% EER for Arabic. This represents a significant step in developing more versatile and efficient speaker verification systems for cognitive and mental health clinical trials that can be used across a wide range of languages and dialects, substantially reducing the effort required to develop speaker verification systems for multiple languages. We also evaluate how speech tasks and number of speakers involved in the trial influence the performance and show that the type of speech tasks impacts the model performance.

Zero-Shot Multi-Lingual Speaker Verification in Clinical Trials

TL;DR

The paper tackles the challenge of duplicate enrollment and cross-language identity verification in multinational clinical trials by applying zero-shot, text-independent speaker verification (SV) to speech from patients with cognitive/mental health disorders. It evaluates three pre-trained SV models (TitaNet, ECAPA-TDNN, SpeakerNet) on five languages (English, German, Danish, Spanish, Arabic) using longitudinal, clinically collected datasets, reporting values that are competitive in cross-language settings without language-specific fine-tuning. Key findings indicate that European languages yield lower (e.g., ~2.7%–2.9% range) while Arabic remains more challenging (~8.3%–8.4%), with performance influenced by speech task (Picture Description generally strongest) and dataset characteristics. The work demonstrates the feasibility of scalable, multilingual SV in clinical trials and provides guidance on task design and threshold calibration, while highlighting ethical considerations, potential biases, and future directions for broader language coverage and dialectal analysis.

Abstract

Due to the substantial number of clinicians, patients, and data collection environments involved in clinical trials, gathering data of superior quality poses a significant challenge. In clinical trials, patients are assessed based on their speech data to detect and monitor cognitive and mental health disorders. We propose using these speech recordings to verify the identities of enrolled patients and identify and exclude the individuals who try to enroll multiple times in the same trial. Since clinical studies are often conducted across different countries, creating a system that can perform speaker verification in diverse languages without additional development effort is imperative. We evaluate pre-trained TitaNet, ECAPA-TDNN, and SpeakerNet models by enrolling and testing with speech-impaired patients speaking English, German, Danish, Spanish, and Arabic languages. Our results demonstrate that tested models can effectively generalize to clinical speakers, with less than 2.7% EER for European Languages and 8.26% EER for Arabic. This represents a significant step in developing more versatile and efficient speaker verification systems for cognitive and mental health clinical trials that can be used across a wide range of languages and dialects, substantially reducing the effort required to develop speaker verification systems for multiple languages. We also evaluate how speech tasks and number of speakers involved in the trial influence the performance and show that the type of speech tasks impacts the model performance.
Paper Structure (17 sections, 6 tables)