Table of Contents
Fetching ...

SLAP: Learning Speaker and Health-Related Representations from Natural Language Supervision

Angelika Ando, Auguste Crabeil, Adrien Lesage, Rachid Riad

TL;DR

SLAP tackles zero-shot inference of speaker demographics, health status, and voice quality by aligning speech with natural-language descriptions through contrastive learning. It integrates a ViT-based audio encoder with an LLM-generated multilingual speaker description layer and a text encoder, trained using a CLAP-style objective plus self-supervised reconstruction, on over 3.4k hours from nine datasets. The model achieves state-of-the-art zero-shot performance and competitive linear probing results across 38 tasks in 7 languages, with strong out-of-domain generalization to unseen languages and clinical populations. This approach enables scalable, language- and population-aware health monitoring from speech, reducing reliance on task-specific labeled data and improving clinical deployment potential.

Abstract

Speech encodes paralinguistic information such as demographics, voice quality, and health. Yet no audio foundation model supports zero-shot or out-of-distribution (OOD) generalization to these tasks. We introduce SLAP (Speaker contrastive Language-Audio Pretraining), the first model aligning speech with natural language descriptions of speaker and health metadata through contrastive learning. SLAP combines a Vision Transformer audio encoder with text encoders, trained on more than 3400 hours across 9 datasets with diverse speaker annotations. We evaluated on 38 binary classification tasks spanning demographics, voice characteristics, and clinical assessments across 14 datasets in 7 languages. SLAP achieves 62.9% average F1 in zero-shot evaluation, a 48% relative improvement over CLAP (42.4%), while demonstrating strong OOD generalization to unseen languages and clinical populations. When fine-tuned with linear probing, SLAP reaches 69.3% F1 overall and achieves best-in-class performance on health tasks (57.9% F1), surpassing larger foundation models.

SLAP: Learning Speaker and Health-Related Representations from Natural Language Supervision

TL;DR

SLAP tackles zero-shot inference of speaker demographics, health status, and voice quality by aligning speech with natural-language descriptions through contrastive learning. It integrates a ViT-based audio encoder with an LLM-generated multilingual speaker description layer and a text encoder, trained using a CLAP-style objective plus self-supervised reconstruction, on over 3.4k hours from nine datasets. The model achieves state-of-the-art zero-shot performance and competitive linear probing results across 38 tasks in 7 languages, with strong out-of-domain generalization to unseen languages and clinical populations. This approach enables scalable, language- and population-aware health monitoring from speech, reducing reliance on task-specific labeled data and improving clinical deployment potential.

Abstract

Speech encodes paralinguistic information such as demographics, voice quality, and health. Yet no audio foundation model supports zero-shot or out-of-distribution (OOD) generalization to these tasks. We introduce SLAP (Speaker contrastive Language-Audio Pretraining), the first model aligning speech with natural language descriptions of speaker and health metadata through contrastive learning. SLAP combines a Vision Transformer audio encoder with text encoders, trained on more than 3400 hours across 9 datasets with diverse speaker annotations. We evaluated on 38 binary classification tasks spanning demographics, voice characteristics, and clinical assessments across 14 datasets in 7 languages. SLAP achieves 62.9% average F1 in zero-shot evaluation, a 48% relative improvement over CLAP (42.4%), while demonstrating strong OOD generalization to unseen languages and clinical populations. When fine-tuned with linear probing, SLAP reaches 69.3% F1 overall and achieves best-in-class performance on health tasks (57.9% F1), surpassing larger foundation models.

Paper Structure

This paper contains 10 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: SLAP pipeline.(a) Pretraining. For each audio, an LLM generates a speaker description, and the Speech Encoder is trained with contrastive and self-supervised objectives. (b) Zero-shot Evaluation. A pair of text prompts are given to the Text Encoder. The class with the higher cosine similarity to the audio embedding is chosen.
  • Figure 2: Performances on open source medical tasks. F1 scores on psychiatry, neurology, and vocal dysfunctions.