Lyrics Transcription for Humans: A Readability-Aware Benchmark

Ondřej Cífka; Hendrik Schreiber; Luke Miner; Fabian-Robert Stöter

Lyrics Transcription for Humans: A Readability-Aware Benchmark

Ondřej Cífka, Hendrik Schreiber, Luke Miner, Fabian-Robert Stöter

TL;DR

Jam-ALT introduces a readability-focused benchmark for automatic lyrics transcription by revising JamendoLyrics MultiLang to conform with industry transcription rules and by designing metrics that jointly evaluate word accuracy and lyric-specific formatting such as line breaks and parental vocal cues. The authors demonstrate that a custom, formatting-aware system outperforms strong baselines and quantify the impact of dataset revisions on WER and formatting metrics. They also validate the approach on the Schubert Winterreise dataset to assess generalization. The work provides a concrete pathway to improve user-facing lyrics displays on platforms by balancing transcription fidelity with readability and musical structure.

Abstract

Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.

Lyrics Transcription for Humans: A Readability-Aware Benchmark

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 8 figures, 5 tables)

This paper contains 13 sections, 4 equations, 8 figures, 5 tables.

Introduction
Dataset
Metrics
Word Error Rates
Punctuation and Line Breaks
Results
Benchmark Results
Effect of Revisions
Error Analysis
Schubert Winterreise Dataset
Discussion
Conclusion
Acknowledgment

Figures (8)

Figure 1: Error types captured by our metrics. Each token is classified as a word, punctuation mark, or parenthesis (enclosing background vocals). Special tokens are added in place of line and section breaks. Each token type is covered by a separate metric; differences in letter case are handled separately.
Figure 2: Song-level word error rates by language. Note that strong outliers occur; for clarity, they are not displayed here, but affect the means, which are indicated by triangles.
Figure 3: Word edit operation frequencies on our benchmark (one run per system). op-near-0.05ex0.4ex Near are substitutions that differ in few characters, op-sub-0.05ex0.4ex sub are the remaining substitutions. op-case-0.05ex0.4ex case are hits with case errors, op-hit-0.05ex0.4ex hit are the remaining (case-sensitive) hits. The rest are op-ins-0.05ex0.4ex insertions and op-del-0.05ex0.4ex deletions. The frequencies are normalized by the reference length, so that: $\text{\setulcolor{op-hit}\setul{-0.05ex}{0.4ex}\ul{\itshape hit}}+\text{\setulcolor{op-case}\setul{-0.05ex}{0.4ex}\ul{\itshape case}}+\text{\setulcolor{op-near}\setul{-0.05ex}{0.4ex}\ul{\itshape near}}+\text{\setulcolor{op-sub}\setul{-0.05ex}{0.4ex}\ul{\itshape sub}}+\text{\setulcolor{op-del}\setul{-0.05ex}{0.4ex}\ul{\itshape del}}=1$, $\text{WER}=\text{\setulcolor{op-near}\setul{-0.05ex}{0.4ex}\ul{\itshape near}}+\text{\setulcolor{op-sub}\setul{-0.05ex}{0.4ex}\ul{\itshape sub}}+\text{\setulcolor{op-ins}\setul{-0.05ex}{0.4ex}\ul{\itshape ins}}+\text{\setulcolor{op-del}\setul{-0.05ex}{0.4ex}\ul{\itshape del}}$, $\text{WER$'$}-\text{WER}=\text{\setulcolor{op-case}\setul{-0.05ex}{0.4ex}\ul{\itshape case}}$, $\text{\setulcolor{op-hit}\setul{-0.05ex}{0.4ex}\ul{\itshape hit}}+\text{\setulcolor{op-case}\setul{-0.05ex}{0.4ex}\ul{\itshape case}}+\text{\setulcolor{op-near}\setul{-0.05ex}{0.4ex}\ul{\itshape near}}+\text{\setulcolor{op-sub}\setul{-0.05ex}{0.4ex}\ul{\itshape sub}}+\text{\setulcolor{op-ins}\setul{-0.05ex}{0.4ex}\ul{\itshape ins}}$ corresponds to the length of the prediction.
Figure 4: Edit operation counts on non-word (punctuation and formatting) tokens by token type (P = punctuation, B = parenthesis, L = line break, S = section break). $\varnothing$ denotes the absence of a token, i.e. it stands for insertion (on the reference axis) or deletion (on the prediction axis). Substitution of/by a word token is counted as an insertion/deletion, respectively. Only a single run per system is considered.
Figure 5: Word edit operation frequencies on SWD. See the caption of \ref{['fig:error-counts']}.
...and 3 more figures

Lyrics Transcription for Humans: A Readability-Aware Benchmark

TL;DR

Abstract

Lyrics Transcription for Humans: A Readability-Aware Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (8)