SLAP: Learning Speaker and Health-Related Representations from Natural Language Supervision
Angelika Ando, Auguste Crabeil, Adrien Lesage, Rachid Riad
TL;DR
SLAP tackles zero-shot inference of speaker demographics, health status, and voice quality by aligning speech with natural-language descriptions through contrastive learning. It integrates a ViT-based audio encoder with an LLM-generated multilingual speaker description layer and a text encoder, trained using a CLAP-style objective plus self-supervised reconstruction, on over 3.4k hours from nine datasets. The model achieves state-of-the-art zero-shot performance and competitive linear probing results across 38 tasks in 7 languages, with strong out-of-domain generalization to unseen languages and clinical populations. This approach enables scalable, language- and population-aware health monitoring from speech, reducing reliance on task-specific labeled data and improving clinical deployment potential.
Abstract
Speech encodes paralinguistic information such as demographics, voice quality, and health. Yet no audio foundation model supports zero-shot or out-of-distribution (OOD) generalization to these tasks. We introduce SLAP (Speaker contrastive Language-Audio Pretraining), the first model aligning speech with natural language descriptions of speaker and health metadata through contrastive learning. SLAP combines a Vision Transformer audio encoder with text encoders, trained on more than 3400 hours across 9 datasets with diverse speaker annotations. We evaluated on 38 binary classification tasks spanning demographics, voice characteristics, and clinical assessments across 14 datasets in 7 languages. SLAP achieves 62.9% average F1 in zero-shot evaluation, a 48% relative improvement over CLAP (42.4%), while demonstrating strong OOD generalization to unseen languages and clinical populations. When fine-tuned with linear probing, SLAP reaches 69.3% F1 overall and achieves best-in-class performance on health tasks (57.9% F1), surpassing larger foundation models.
