Speech foundation models on intelligibility prediction for hearing-impaired listeners

Santiago Cuervo; Ricard Marxer

Speech foundation models on intelligibility prediction for hearing-impaired listeners

Santiago Cuervo, Ricard Marxer

TL;DR

A simple method that learns a lightweight specialized prediction head on top of frozen SFMs to approach the speech intelligibility prediction problem results in the winning submission in the CPC2, demonstrating its promise for speech perception applications.

Abstract

Speech foundation models (SFMs) have been benchmarked on many speech processing tasks, often achieving state-of-the-art performance with minimal adaptation. However, the SFM paradigm has been significantly less explored for applications of interest to the speech perception community. In this paper we present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction. We focus on the non-intrusive setup of the Clarity Prediction Challenge 2 (CPC2), where the task is to predict the percentage of words correctly perceived by hearing-impaired listeners from speech-in-noise recordings. We propose a simple method that learns a lightweight specialized prediction head on top of frozen SFMs to approach the problem. Our results reveal statistically significant differences in performance across SFMs. Our method resulted in the winning submission in the CPC2, demonstrating its promise for speech perception applications.

Speech foundation models on intelligibility prediction for hearing-impaired listeners

TL;DR

Abstract

Paper Structure (17 sections, 1 figure, 4 tables)

This paper contains 17 sections, 1 figure, 4 tables.

Introduction
Intelligibility prediction model
Backbone
Prediction head
Experiments and Results
Experimental setup
Data, metrics, and baselines
Training
Software and computational cost
Results
Backbone performance
Binaural cross-attention ablation
Our submission to the CPC2
Ensemble performance
Related work
...and 2 more sections

Figures (1)

Figure 1: Intelligibility prediction model architecture. With the exception of frozen backbone blocks (grey), blocks with the same color indicate shared parameters. Left: Pipeline applied to each channel of the binaural signal . Right: Binaural block used in the temporal and layer transformers. The cross-attention layer enables modeling of non-linear binaural interactions.

Speech foundation models on intelligibility prediction for hearing-impaired listeners

TL;DR

Abstract

Speech foundation models on intelligibility prediction for hearing-impaired listeners

Authors

TL;DR

Abstract

Table of Contents

Figures (1)