LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

Le Zhuo; Ruibin Yuan; Jiahao Pan; Yinghao Ma; Yizhi LI; Ge Zhang; Si Liu; Roger Dannenberg; Jie Fu; Chenghua Lin; Emmanouil Benetos; Wei Xue; Yike Guo

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wei Xue, Yike Guo

TL;DR

LyricWhiz presents a training-free, multilingual zero-shot automatic lyrics transcription framework that fuses Whisper as the transcription engine with GPT-4 as a contextual post-processor. The system achieves state-of-the-art Word Error Rate ($WER$) on English lyrics and demonstrates strong cross-language transcription, while introducing MulJam, the first large-scale, copyright-free multilingual lyrics dataset with a human-annotated noise subset. The approach leverages prompt engineering and a chain-of-thought ensemble strategy to reconcile multiple Whisper outputs, enabling robust long-form transcription across diverse genres. By providing MulJam under CC BY-NC-SA, the work lowers barriers for multilingual ALT research and downstream MIR tasks, potentially accelerating development of cross-language lyric understanding and related music information retrieval applications.

Abstract

We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

TL;DR

) on English lyrics and demonstrates strong cross-language transcription, while introducing MulJam, the first large-scale, copyright-free multilingual lyrics dataset with a human-annotated noise subset. The approach leverages prompt engineering and a chain-of-thought ensemble strategy to reconcile multiple Whisper outputs, enabling robust long-form transcription across diverse genres. By providing MulJam under CC BY-NC-SA, the work lowers barriers for multilingual ALT research and downstream MIR tasks, potentially accelerating development of cross-language lyric understanding and related music information retrieval applications.

Abstract

Paper Structure (16 sections, 3 figures, 5 tables)

This paper contains 16 sections, 3 figures, 5 tables.

Introduction
Related Work
Automatic Lyrics Transcription
Weakly Supervised Automatic Speech Recognition
Chat-based Large Language Models
Methodology
Whisper as Zero-shot Lyrics Transcriptor
ChatGPT as Effective Lyrics Post-processor
Multilingual Lyrics Transcription Dataset
Experiments
Experimental Setup
Comparative Experiments
Ablation Studies
Dataset Analysis
Conclusion
...and 1 more sections

Figures (3)

Figure 1: Concept illustration of the working LyricWhiz, where user prompts the two advanced models, Whisper and ChatGPT, to perform automatic lyrics transcription.
Figure 2: Framework of the proposed LyricWhiz. In the first stage, we employ PANNS kong2020panns, to detect audio events and filter out non-vocal recordings. In the second stage, we utilize the language identification module in Whisper to predict input audio language. We then construct language-specific prompts for Whisper and transcribe input audio multiple times. In the final stage, we request ChatGPT with CoT instructions to ensemble multiple predictions and generate the final lyrics.
Figure :

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

TL;DR

Abstract

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

Authors

TL;DR

Abstract

Table of Contents

Figures (3)