Table of Contents
Fetching ...

Salmon: A Suite for Acoustic Language Model Evaluation

Gallil Maimon, Amit Roth, Yossi Adi

TL;DR

SALMon tackles the gap in evaluating speech language models for non-semantic acoustic aspects by introducing a modeling-based suite with two core tasks: Acoustic Consistency and Acoustic-Semantic Alignment. It defines a clear scoring framework and leverages diverse data pipelines to test robustness to speaker, background, and room variations as well as alignment with spoken content. Through extensive baselines and human evaluation, the study shows current models struggle to match human performance on acoustic tasks, highlighting opportunities for improving joint acoustic and semantic modelling. The work provides actionable benchmarks and open-source resources to accelerate development of more acoustically aware SLMs with practical impact on robust speech understanding and generation.

Abstract

Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method. We make the code and data publicly available at https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .

Salmon: A Suite for Acoustic Language Model Evaluation

TL;DR

SALMon tackles the gap in evaluating speech language models for non-semantic acoustic aspects by introducing a modeling-based suite with two core tasks: Acoustic Consistency and Acoustic-Semantic Alignment. It defines a clear scoring framework and leverages diverse data pipelines to test robustness to speaker, background, and room variations as well as alignment with spoken content. Through extensive baselines and human evaluation, the study shows current models struggle to match human performance on acoustic tasks, highlighting opportunities for improving joint acoustic and semantic modelling. The work provides actionable benchmarks and open-source resources to accelerate development of more acoustically aware SLMs with practical impact on robust speech understanding and generation.

Abstract

Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap, we introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response. The proposed benchmarks both evaluate the consistency of the inspected element and how much it matches the spoken text. We follow a modelling based approach, measuring whether a model gives correct samples higher scores than incorrect ones. This approach makes the benchmark fast to compute even for large models. We evaluated several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method. We make the code and data publicly available at https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .
Paper Structure (9 sections, 2 equations, 1 figure, 2 tables)

This paper contains 9 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: A demonstration of SALMon - in which a speech LM is meant to give higher likelihood to real samples. a) shows acoustic consistency, in this case sentiment consistency, where the negative sample changes emotion mid-sentence, and b) shows semantic-acoustic alignment - in this example sentiment alignment.