Table of Contents
Fetching ...

Singer separation for karaoke content generation

Hsuan-Yu Lin, Xuanjun Chen, Jyh-Shing Roger Jang

TL;DR

This work tackles the task of lead-singer extraction for karaoke by introducing a two-stage singer separation framework: Stage 1 separates accompaniment from vocals using an enhanced Wave-U-Net+ model, and Stage 2 performs lead-vocal separation with DPRNN or DPTNet to obtain two vocal stems. It also presents an automatic model-selection mechanism that chooses among language- and singer-configured models based on pitch trends, and publicly releases MIR-SingerSeparation, a bilingual dataset designed for duet and harmonic vocal separation. Experimental results show that the proposed system (SSSYS) outperforms a 3-channel baseline in SI-SNRi and SDRi across English and Chinese duet and self-harmonic configurations, with a practical auto-selection accuracy of 71.43% on real karaoke data. The work enables realistic karaoke content generation by enabling reliable single- or dual-lead vocal separation, particularly for sentimental ballads, and provides substantial resources for advancing this research area.

Abstract

Due to the rapid development of deep learning, we can now successfully separate singing voice from mono audio music. However, this separation can only extract human voices from other musical instruments, which is undesirable for karaoke content generation applications that only require the separation of lead singers. For this karaoke application, we need to separate the music containing male and female duets into two vocals, or extract a single lead vocal from the music containing vocal harmony. For this reason, we propose in this article to use a singer separation system, which generates karaoke content for one or two separated lead singers. In particular, we introduced three models for the singer separation task and designed an automatic model selection scheme to distinguish how many lead singers are in the song. We also collected a large enough data set, MIR-SingerSeparation, which has been publicly released to advance the frontier of this research. Our singer separation is most suitable for sentimental ballads and can be directly applied to karaoke content generation. As far as we know, this is the first singer-separation work for real-world karaoke applications.

Singer separation for karaoke content generation

TL;DR

This work tackles the task of lead-singer extraction for karaoke by introducing a two-stage singer separation framework: Stage 1 separates accompaniment from vocals using an enhanced Wave-U-Net+ model, and Stage 2 performs lead-vocal separation with DPRNN or DPTNet to obtain two vocal stems. It also presents an automatic model-selection mechanism that chooses among language- and singer-configured models based on pitch trends, and publicly releases MIR-SingerSeparation, a bilingual dataset designed for duet and harmonic vocal separation. Experimental results show that the proposed system (SSSYS) outperforms a 3-channel baseline in SI-SNRi and SDRi across English and Chinese duet and self-harmonic configurations, with a practical auto-selection accuracy of 71.43% on real karaoke data. The work enables realistic karaoke content generation by enabling reliable single- or dual-lead vocal separation, particularly for sentimental ballads, and provides substantial resources for advancing this research area.

Abstract

Due to the rapid development of deep learning, we can now successfully separate singing voice from mono audio music. However, this separation can only extract human voices from other musical instruments, which is undesirable for karaoke content generation applications that only require the separation of lead singers. For this karaoke application, we need to separate the music containing male and female duets into two vocals, or extract a single lead vocal from the music containing vocal harmony. For this reason, we propose in this article to use a singer separation system, which generates karaoke content for one or two separated lead singers. In particular, we introduced three models for the singer separation task and designed an automatic model selection scheme to distinguish how many lead singers are in the song. We also collected a large enough data set, MIR-SingerSeparation, which has been publicly released to advance the frontier of this research. Our singer separation is most suitable for sentimental ballads and can be directly applied to karaoke content generation. As far as we know, this is the first singer-separation work for real-world karaoke applications.

Paper Structure

This paper contains 10 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The pairing methods of different datasets. Capital letters represent different singers' vocals (yellow, green, and red blocks). Lowercase a is the accompaniment (blue block). Numbers represent the index of segments in a song.
  • Figure 2: System flowchart of singer separation system.
  • Figure 3: The pitches trend in a song, blue and red lines at the bottom of (a) are the pitches of the two lead singers, and they correspond to vocal A and B of (b), respectively.