Table of Contents
Fetching ...

ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models

Zixing Zhang, Weixiang Xu, Zhongren Dong, Kanglin Wang, Yimeng Wu, Jing Peng, Runming Wang, Dong-Yan Huang

TL;DR

A large-scale benchmark, namely ParaLBench, is conducted, which concentrates on standardizing the evaluation process of diverse paralinguistic tasks, including critical aspects of affective computing such as emotion recognition and emotion dimensions prediction, over different acoustic foundation models.

Abstract

Computational paralinguistics (ComParal) aims to develop algorithms and models to automatically detect, analyze, and interpret non-verbal information from speech communication, e. g., emotion, health state, age, and gender. Despite its rapid progress, it heavily depends on sophisticatedly designed models given specific paralinguistic tasks. Thus, the heterogeneity and diversity of ComParal models largely prevent the realistic implementation of ComParal models. Recently, with the advent of acoustic foundation models because of self-supervised learning, developing more generic models that can efficiently perceive a plethora of paralinguistic information has become an active topic in speech processing. However, it lacks a unified evaluation framework for a fair and consistent performance comparison. To bridge this gap, we conduct a large-scale benchmark, namely ParaLBench, which concentrates on standardizing the evaluation process of diverse paralinguistic tasks, including critical aspects of affective computing such as emotion recognition and emotion dimensions prediction, over different acoustic foundation models. This benchmark contains ten datasets with thirteen distinct paralinguistic tasks, covering short-, medium- and long-term characteristics. Each task is carried out on 14 acoustic foundation models under a unified evaluation framework, which allows for an unbiased methodological comparison and offers a grounded reference for the ComParal community. Based on the insights gained from ParaLBench, we also point out potential research directions, i.e., the cross-corpus generalizability, to propel ComParal research in the future. The code associated with this study will be available to foster the transparency and replicability of this work for succeeding researchers.

ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models

TL;DR

A large-scale benchmark, namely ParaLBench, is conducted, which concentrates on standardizing the evaluation process of diverse paralinguistic tasks, including critical aspects of affective computing such as emotion recognition and emotion dimensions prediction, over different acoustic foundation models.

Abstract

Computational paralinguistics (ComParal) aims to develop algorithms and models to automatically detect, analyze, and interpret non-verbal information from speech communication, e. g., emotion, health state, age, and gender. Despite its rapid progress, it heavily depends on sophisticatedly designed models given specific paralinguistic tasks. Thus, the heterogeneity and diversity of ComParal models largely prevent the realistic implementation of ComParal models. Recently, with the advent of acoustic foundation models because of self-supervised learning, developing more generic models that can efficiently perceive a plethora of paralinguistic information has become an active topic in speech processing. However, it lacks a unified evaluation framework for a fair and consistent performance comparison. To bridge this gap, we conduct a large-scale benchmark, namely ParaLBench, which concentrates on standardizing the evaluation process of diverse paralinguistic tasks, including critical aspects of affective computing such as emotion recognition and emotion dimensions prediction, over different acoustic foundation models. This benchmark contains ten datasets with thirteen distinct paralinguistic tasks, covering short-, medium- and long-term characteristics. Each task is carried out on 14 acoustic foundation models under a unified evaluation framework, which allows for an unbiased methodological comparison and offers a grounded reference for the ComParal community. Based on the insights gained from ParaLBench, we also point out potential research directions, i.e., the cross-corpus generalizability, to propel ComParal research in the future. The code associated with this study will be available to foster the transparency and replicability of this work for succeeding researchers.

Paper Structure

This paper contains 59 sections, 10 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Compared to linguistics that studies the text content, paralinguistics analyzes all other information from speech, such as emotion, age, gender, and health status.
  • Figure 2: An evaluation diagram for ParaLBench. It first extracts deep representations from audio signals through an acoustic foundation model. These representations are then projected to the ones with a fixed dimension and sequentially fed into a standard Transformer. Finally, a task-specific classifier (i. e., two dense layers) is appended for classification.
  • Figure 3: Cross-corpus results of the acoustic foundation model on three emotion datasets: IEMOCAP, MELD, and MSP-Podcast. The subtitles, such as 'IEMOCAP to MELD,' indicate that the model is trained on the IEMOCAP dataset and tested for cross-corpus performance on the MELD dataset. The orange line represents the results achieved by the models on the original dataset.
  • Figure 4: Performance comparison of acoustic foundation models with and without LoRA efficient fine-tuning.
  • Figure 5: The performance of the WavLM model layer feature across short, medium, and long-term datasets. The upper and lower figures display the Base and Large versions of the acoustic foundation model. Considering the computational cost, we test the results every 3 layers apart.