LIWhiz: A Non-Intrusive Lyric Intelligibility Prediction System for the Cadenza Challenge
Ram C. M. C. Shekar, Iván López-Espejo
TL;DR
LIWhiz tackles lyric intelligibility prediction for listeners, including those with hearing loss, using a non-intrusive approach. It builds a Whisper-based front-end to extract rich features from both the original and hearing-loss-simulated audio and a trainable back-end that fuses these representations through linear mixing, a Bi-LSTM, and a final sigmoid predictor. On the Cadenza CLIP dataset, LIWhiz outperforms both non-intrusive STOI and intrusive Whisper baselines, achieving RMSE around 27% and NCC around 0.65 on evaluation, with ablations showing the benefit of including the original audio. The results demonstrate a robust, non-intrusive method for lyric intelligibility prediction that could enable lyric enhancement and improved accessibility in music listening.
Abstract
We present LIWhiz, a non-intrusive lyric intelligibility prediction system submitted to the ICASSP 2026 Cadenza Challenge. LIWhiz leverages Whisper for robust feature extraction and a trainable back-end for score prediction. Tested on the Cadenza Lyric Intelligibility Prediction (CLIP) evaluation set, LIWhiz achieves a root mean square error (RMSE) of 27.07%, a 22.4% relative RMSE reduction over the STOI-based baseline, yielding a substantial improvement in normalized cross-correlation.
