Table of Contents
Fetching ...

voc2vec: A Foundation Model for Non-Verbal Vocalization

Alkis Koudounas, Moreno La Quatra, Marco Sabato Siniscalchi, Elena Baralis

TL;DR

This work introduces voc2vec, a foundation model tailored for non-verbal vocalizations, addressing the limitations of speech- and general-audio pretrained models in capturing vocal bursts. Built on a wav2vec 2.0–style SSL framework with a CNN encoder and Transformer, voc2vec is pre-trained on ~125 hours from 10 open non-verbal datasets and examined under three initialization strategies (scratch, LibriSpeech-, and AudioSet-pretrained). Across six downstream tasks, voc2vec-ls achieves state-of-the-art results, with approximately 5 percentage-point gains in UAR, 2 points in Accuracy, and 4 points in F1 Macro, and substantial improvements over OpenSmile and emotion2vec. The model is released as open-source to accelerate research and practical deployments in domains where interpreting non-verbal human sounds is crucial, such as infant monitoring and mental health. Overall, voc2vec advances universal representation learning for vocalization, bridging a gap between speech-focused models and broad audio foundation models.

Abstract

Speech foundation models have demonstrated exceptional capabilities in speech-related tasks. Nevertheless, these models often struggle with non-verbal audio data, such as vocalizations, baby crying, etc., which are critical for various real-world applications. Audio foundation models well handle non-speech data but also fail to capture the nuanced features of non-verbal human sounds. In this work, we aim to overcome the above shortcoming and propose a novel foundation model, termed voc2vec, specifically designed for non-verbal human data leveraging exclusively open-source non-verbal audio datasets. We employ a collection of 10 datasets covering around 125 hours of non-verbal audio. Experimental results prove that voc2vec is effective in non-verbal vocalization classification, and it outperforms conventional speech and audio foundation models. Moreover, voc2vec consistently outperforms strong baselines, namely OpenSmile and emotion2vec, on six different benchmark datasets. To the best of the authors' knowledge, voc2vec is the first universal representation model for vocalization tasks.

voc2vec: A Foundation Model for Non-Verbal Vocalization

TL;DR

This work introduces voc2vec, a foundation model tailored for non-verbal vocalizations, addressing the limitations of speech- and general-audio pretrained models in capturing vocal bursts. Built on a wav2vec 2.0–style SSL framework with a CNN encoder and Transformer, voc2vec is pre-trained on ~125 hours from 10 open non-verbal datasets and examined under three initialization strategies (scratch, LibriSpeech-, and AudioSet-pretrained). Across six downstream tasks, voc2vec-ls achieves state-of-the-art results, with approximately 5 percentage-point gains in UAR, 2 points in Accuracy, and 4 points in F1 Macro, and substantial improvements over OpenSmile and emotion2vec. The model is released as open-source to accelerate research and practical deployments in domains where interpreting non-verbal human sounds is crucial, such as infant monitoring and mental health. Overall, voc2vec advances universal representation learning for vocalization, bridging a gap between speech-focused models and broad audio foundation models.

Abstract

Speech foundation models have demonstrated exceptional capabilities in speech-related tasks. Nevertheless, these models often struggle with non-verbal audio data, such as vocalizations, baby crying, etc., which are critical for various real-world applications. Audio foundation models well handle non-speech data but also fail to capture the nuanced features of non-verbal human sounds. In this work, we aim to overcome the above shortcoming and propose a novel foundation model, termed voc2vec, specifically designed for non-verbal human data leveraging exclusively open-source non-verbal audio datasets. We employ a collection of 10 datasets covering around 125 hours of non-verbal audio. Experimental results prove that voc2vec is effective in non-verbal vocalization classification, and it outperforms conventional speech and audio foundation models. Moreover, voc2vec consistently outperforms strong baselines, namely OpenSmile and emotion2vec, on six different benchmark datasets. To the best of the authors' knowledge, voc2vec is the first universal representation model for vocalization tasks.

Paper Structure

This paper contains 11 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: t-SNE visualization.VIVAE dataset, first fold. wav2vec-ls (left), hubert-ls (center), voc2vec-ls (right).