Table of Contents
Fetching ...

Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users using Intermediate ASR Features and Human Memory Models

Rhiannon Mogridge, George Close, Robert Sutherland, Thomas Hain, Jon Barker, Stefan Goetze, Anton Ragni

TL;DR

This work tackles non-intrusive speech intelligibility prediction for hearing-impaired users by leveraging frozen Whisper decoder-layer representations as input features and an exemplar-informed memory component. It introduces a two-branch ensemble (primary and exemplar-informed secondary) whose outputs are averaged to predict intelligibility scores, and demonstrates strong generalization to unseen listeners and enhancement systems within CPC2. On CPC2 data, the method achieves an RMSE of $25.3$, outperforming the intrusive HASPI baseline ($28.7$) and approaching the top CPC2 entries, with Whisper layers $7$ and $8$ identified as most informative. The results highlight practical impact for evaluating and optimizing hearing aid outputs in non-intrusive settings, enabling efficient, scalable intelligibility estimation.

Abstract

Neural networks have been successfully used for non-intrusive speech intelligibility prediction. Recently, the use of feature representations sourced from intermediate layers of pre-trained self-supervised and weakly-supervised models has been found to be particularly useful for this task. This work combines the use of Whisper ASR decoder layer representations as neural network input features with an exemplar-based, psychologically motivated model of human memory to predict human intelligibility ratings for hearing-aid users. Substantial performance improvement over an established intrusive HASPI baseline system is found, including on enhancement systems and listeners unseen in the training data, with a root mean squared error of 25.3 compared with the baseline of 28.7.

Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users using Intermediate ASR Features and Human Memory Models

TL;DR

This work tackles non-intrusive speech intelligibility prediction for hearing-impaired users by leveraging frozen Whisper decoder-layer representations as input features and an exemplar-informed memory component. It introduces a two-branch ensemble (primary and exemplar-informed secondary) whose outputs are averaged to predict intelligibility scores, and demonstrates strong generalization to unseen listeners and enhancement systems within CPC2. On CPC2 data, the method achieves an RMSE of , outperforming the intrusive HASPI baseline () and approaching the top CPC2 entries, with Whisper layers and identified as most informative. The results highlight practical impact for evaluating and optimizing hearing aid outputs in non-intrusive settings, enabling efficient, scalable intelligibility estimation.

Abstract

Neural networks have been successfully used for non-intrusive speech intelligibility prediction. Recently, the use of feature representations sourced from intermediate layers of pre-trained self-supervised and weakly-supervised models has been found to be particularly useful for this task. This work combines the use of Whisper ASR decoder layer representations as neural network input features with an exemplar-based, psychologically motivated model of human memory to predict human intelligibility ratings for hearing-aid users. Substantial performance improvement over an established intrusive HASPI baseline system is found, including on enhancement systems and listeners unseen in the training data, with a root mean squared error of 25.3 compared with the baseline of 28.7.
Paper Structure (18 sections, 1 equation, 7 figures, 1 table)

This paper contains 18 sections, 1 equation, 7 figures, 1 table.

Figures (7)

  • Figure 1: Distribution of true correctness values in the training data.
  • Figure 2: Model architecture of proposed SI prediction.
  • Figure 3: Model performance by true correctness.
  • Figure 4: Model performance by mean hearing aid system correctness.
  • Figure 5: Performance of proposed prediction system depending on enhancement system
  • ...and 2 more figures