Table of Contents
Fetching ...

LIWhiz: A Non-Intrusive Lyric Intelligibility Prediction System for the Cadenza Challenge

Ram C. M. C. Shekar, Iván López-Espejo

TL;DR

LIWhiz tackles lyric intelligibility prediction for listeners, including those with hearing loss, using a non-intrusive approach. It builds a Whisper-based front-end to extract rich features from both the original and hearing-loss-simulated audio and a trainable back-end that fuses these representations through linear mixing, a Bi-LSTM, and a final sigmoid predictor. On the Cadenza CLIP dataset, LIWhiz outperforms both non-intrusive STOI and intrusive Whisper baselines, achieving RMSE around 27% and NCC around 0.65 on evaluation, with ablations showing the benefit of including the original audio. The results demonstrate a robust, non-intrusive method for lyric intelligibility prediction that could enable lyric enhancement and improved accessibility in music listening.

Abstract

We present LIWhiz, a non-intrusive lyric intelligibility prediction system submitted to the ICASSP 2026 Cadenza Challenge. LIWhiz leverages Whisper for robust feature extraction and a trainable back-end for score prediction. Tested on the Cadenza Lyric Intelligibility Prediction (CLIP) evaluation set, LIWhiz achieves a root mean square error (RMSE) of 27.07%, a 22.4% relative RMSE reduction over the STOI-based baseline, yielding a substantial improvement in normalized cross-correlation.

LIWhiz: A Non-Intrusive Lyric Intelligibility Prediction System for the Cadenza Challenge

TL;DR

LIWhiz tackles lyric intelligibility prediction for listeners, including those with hearing loss, using a non-intrusive approach. It builds a Whisper-based front-end to extract rich features from both the original and hearing-loss-simulated audio and a trainable back-end that fuses these representations through linear mixing, a Bi-LSTM, and a final sigmoid predictor. On the Cadenza CLIP dataset, LIWhiz outperforms both non-intrusive STOI and intrusive Whisper baselines, achieving RMSE around 27% and NCC around 0.65 on evaluation, with ablations showing the benefit of including the original audio. The results demonstrate a robust, non-intrusive method for lyric intelligibility prediction that could enable lyric enhancement and improved accessibility in music listening.

Abstract

We present LIWhiz, a non-intrusive lyric intelligibility prediction system submitted to the ICASSP 2026 Cadenza Challenge. LIWhiz leverages Whisper for robust feature extraction and a trainable back-end for score prediction. Tested on the Cadenza Lyric Intelligibility Prediction (CLIP) evaluation set, LIWhiz achieves a root mean square error (RMSE) of 27.07%, a 22.4% relative RMSE reduction over the STOI-based baseline, yielding a substantial improvement in normalized cross-correlation.

Paper Structure

This paper contains 4 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Diagram of the proposed LIP system, LIWhiz. The feature extractor (front-end), based on a frozen Whisper model, is shown on the left, where $\mathbf{s}\in\{\mathbf{x},\mathbf{y}\}$. The trainable back-end, which produces lyric intelligibility scores $I$ from features extracted from the original song excerpt $\mathbf{x}$ and its hearing-loss-simulated version $\mathbf{y}$, is shown on the right. LML denotes a linear mixing layer.
  • Figure 2: Normalized absolute values of the learned LML weights for both the encoder (left) and decoder (right).