Table of Contents
Fetching ...

ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks

Xin Jing, Andreas Triantafyllopoulos, Björn Schuller

TL;DR

ParaCLAP targets the data bottleneck in computational paralinguistics by adapting contrastive language-audio pretraining to CP tasks. It jointly trains wav2vec 2.0 large (audio) and BERT-based text encoders to a shared $d=768$-dimensional space with a symmetric contrastive loss, enabling zero-shot task generalization through language queries. A templated query-generation strategy combines label-based descriptions with expert features from eGeMAPS to expand training diversity, improving cross-task performance and even generalizing to non-English CP datasets. The approach yields substantial gains over state-of-the-art open-source baselines across multiple CP benchmarks and demonstrates the value of diverse, expert-informed prompts for CP transfer learning, with clear avenues for future enhancements via large language models and data scale.

Abstract

Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to `answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models.

ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks

TL;DR

ParaCLAP targets the data bottleneck in computational paralinguistics by adapting contrastive language-audio pretraining to CP tasks. It jointly trains wav2vec 2.0 large (audio) and BERT-based text encoders to a shared -dimensional space with a symmetric contrastive loss, enabling zero-shot task generalization through language queries. A templated query-generation strategy combines label-based descriptions with expert features from eGeMAPS to expand training diversity, improving cross-task performance and even generalizing to non-English CP datasets. The approach yields substantial gains over state-of-the-art open-source baselines across multiple CP benchmarks and demonstrates the value of diverse, expert-informed prompts for CP transfer learning, with clear avenues for future enhancements via large language models and data scale.

Abstract

Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to `answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models.
Paper Structure (12 sections, 2 equations, 2 figures, 1 table)

This paper contains 12 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Diagram of our proposed ParaCLAP model. ParaCLAP jointly trains two parallel encoders to establish a link between the input audio-text pairs in a shared multimodal space. In the training phase, a maximum of $n$ text prompts are randomly selected from the pool of generated text queries, combining both label and pseudo-caption information, while during inference, the query comprises only the actual labels.
  • Figure 2: The confusion matrices illustrate the results on the IEMOCAP dataset. Models share identical training settings, differing in the inclusion of emotion queries within the audio-text pairs (left: without, right: with)