Table of Contents
Fetching ...

Leveraging Whisper Embeddings for Audio-based Lyrics Matching

Eleonora Mancini, Joan Serrà, Paolo Torroni, Yuki Mitsufuji

TL;DR

The paper addresses reproducibility and data scarcity in lyrics-based music information retrieval by proposing WEALY, an end-to-end audio-based lyrics matching pipeline that relies on Whisper decoder embeddings to derive lyrics-aware representations directly from audio. It trains a transformer-based encoder using NT-Xent contrastive loss on musical version identification datasets (DVI, SHS, LYC) and demonstrates competitive performance against transcription-based baselines while maintaining full reproducibility. Ablation studies reveal the importance of the NT-Xent loss, GeM pooling, and multilingual Whisper cues, and a straightforward late fusion with an audio-content model further improves retrieval. The authors release code and model checkpoints to foster transparency and establish a robust benchmark for future MIR research, highlighting the practical impact on copyright detection, music discovery, and version identification.

Abstract

Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.

Leveraging Whisper Embeddings for Audio-based Lyrics Matching

TL;DR

The paper addresses reproducibility and data scarcity in lyrics-based music information retrieval by proposing WEALY, an end-to-end audio-based lyrics matching pipeline that relies on Whisper decoder embeddings to derive lyrics-aware representations directly from audio. It trains a transformer-based encoder using NT-Xent contrastive loss on musical version identification datasets (DVI, SHS, LYC) and demonstrates competitive performance against transcription-based baselines while maintaining full reproducibility. Ablation studies reveal the importance of the NT-Xent loss, GeM pooling, and multilingual Whisper cues, and a straightforward late fusion with an audio-content model further improves retrieval. The authors release code and model checkpoints to foster transparency and establish a robust benchmark for future MIR research, highlighting the practical impact on copyright detection, music discovery, and version identification.

Abstract

Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.

Paper Structure

This paper contains 12 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the WEALY architecture. Stage 1 extracts lyrical latents with Whisper, while stage 2 learns contextualized representations with a transformer encoder and projects them into a compact embedding space.