Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata
Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao
TL;DR
This paper tackles non-intrusive prediction of speech intelligibility for hearing aids by introducing two Whisper-based improvements to MBI-Net. MBI-Net+ uses Whisper embeddings for richer cross-domain features, while MBI-Net++ adds a multi-task framework that jointly predicts intelligibility and HASPI with a loss $O = \alpha \cdot \mathcal{L}_{Int} + \beta \cdot \mathcal{L}_{HASPI}$. Experiments on the CPC 2023 Clarity dataset show that Whisper-based features and auxiliary HASPI supervision yield superior performance, with MBI-Net++ achieving the best non-intrusive results and ranking highly in the challenge. These findings highlight the value of cross-domain representations and auxiliary metrics for robust hearing-aid intelligibility assessment in real-world scenarios.
Abstract
Automated speech intelligibility assessment is pivotal for hearing aid (HA) development. In this paper, we present three novel methods to improve intelligibility prediction accuracy and introduce MBI-Net+, an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge. MBI-Net+ leverages Whisper's embeddings to create cross-domain acoustic features and includes metadata from speech signals by using a classifier that distinguishes different enhancement methods. Furthermore, MBI-Net+ integrates the hearing-aid speech perception index (HASPI) as a supplementary metric into the objective function to further boost prediction performance. Experimental results demonstrate that MBI-Net+ surpasses several intrusive baseline systems and MBI-Net on the Clarity Prediction Challenge 2023 dataset, validating the effectiveness of incorporating Whisper embeddings, speech metadata, and related complementary metrics to improve prediction performance for HA.
