Table of Contents
Fetching ...

More than words: Advancements and challenges in speech recognition for singing

Anna Kruspe

TL;DR

This paper surveys the challenges of applying ASR to singing, where pitch variation, vocal styles, and polyphony complicate transcription. It covers phoneme recognition, sung language identification, keyword spotting, lyrics alignment, lyrics-based retrieval, and full transcription, tracing methodological advances from MFCC/HMM baselines to end-to-end and transformer-based models. The work highlights data scarcity and the role of large, diverse datasets and multilingual resources, illustrating progress from early, adapted speech models to modern singing-specific approaches. It argues that advances in deep learning and foundation models will drive practical systems for accessibility, search, and cross-cultural music understanding.

Abstract

This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. Singing encompasses unique challenges, including extensive pitch variations, diverse vocal styles, and background music interference. We explore key areas such as phoneme recognition, language identification in songs, keyword spotting, and full lyrics transcription. I will describe some of my own experiences when performing research on these tasks just as they were starting to gain traction, but will also show how recent developments in deep learning and large-scale datasets have propelled progress in this field. My goal is to illuminate the complexities of applying speech recognition to singing, evaluate current capabilities, and outline future research directions.

More than words: Advancements and challenges in speech recognition for singing

TL;DR

This paper surveys the challenges of applying ASR to singing, where pitch variation, vocal styles, and polyphony complicate transcription. It covers phoneme recognition, sung language identification, keyword spotting, lyrics alignment, lyrics-based retrieval, and full transcription, tracing methodological advances from MFCC/HMM baselines to end-to-end and transformer-based models. The work highlights data scarcity and the role of large, diverse datasets and multilingual resources, illustrating progress from early, adapted speech models to modern singing-specific approaches. It argues that advances in deep learning and foundation models will drive practical systems for accessibility, search, and cross-cultural music understanding.

Abstract

This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. Singing encompasses unique challenges, including extensive pitch variations, diverse vocal styles, and background music interference. We explore key areas such as phoneme recognition, language identification in songs, keyword spotting, and full lyrics transcription. I will describe some of my own experiences when performing research on these tasks just as they were starting to gain traction, but will also show how recent developments in deep learning and large-scale datasets have propelled progress in this field. My goal is to illuminate the complexities of applying speech recognition to singing, evaluate current capabilities, and outline future research directions.
Paper Structure (9 sections)