Table of Contents
Fetching ...

Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

Ziqian Ning, Shuai Wang, Yuepeng Jiang, Jixun Yao, Lei He, Shifeng Pan, Jie Ding, Lei Xie

TL;DR

Freestyler tackles the problem of generating rapping vocals conditioned on both lyrics and accompaniment, addressing the lack of rap-focused vocal generation and rhythmic alignment with beats. It introduces a three-stage pipeline (lyrics-to-semantic, semantic-to-spectrogram, spectrogram-to-audio) that relies on discrete semantic tokens and a conditional flow matching model, augmented by a 3-second reference audio for zero-shot timbre control. The authors also present RapBank, a large, publicly processed rap dataset built to overcome data scarcity and enable training of accompaniment-conditioned rap generation. Experimental results show Freestyler delivers high-quality, rhythmically aligned rap with strong beat synchronization, and zero-shot timbre control demonstrates robust generalization to unseen timbres. The work advances along practical lines by releasing both the dataset and processing pipeline publicly, enabling broader research in accompaniment-conditioned vocal generation for rap.

Abstract

Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.

Drop the beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation

TL;DR

Freestyler tackles the problem of generating rapping vocals conditioned on both lyrics and accompaniment, addressing the lack of rap-focused vocal generation and rhythmic alignment with beats. It introduces a three-stage pipeline (lyrics-to-semantic, semantic-to-spectrogram, spectrogram-to-audio) that relies on discrete semantic tokens and a conditional flow matching model, augmented by a 3-second reference audio for zero-shot timbre control. The authors also present RapBank, a large, publicly processed rap dataset built to overcome data scarcity and enable training of accompaniment-conditioned rap generation. Experimental results show Freestyler delivers high-quality, rhythmically aligned rap with strong beat synchronization, and zero-shot timbre control demonstrates robust generalization to unseen timbres. The work advances along practical lines by releasing both the dataset and processing pipeline publicly, enabling broader research in accompaniment-conditioned vocal generation for rap.

Abstract

Rap, a prominent genre of vocal performance, remains underexplored in vocal generation. General vocal synthesis depends on precise note and duration inputs, requiring users to have related musical knowledge, which limits flexibility. In contrast, rap typically features simpler melodies, with a core focus on a strong rhythmic sense that harmonizes with accompanying beats. In this paper, we propose Freestyler, the first system that generates rapping vocals directly from lyrics and accompaniment inputs. Freestyler utilizes language model-based token generation, followed by a conditional flow matching model to produce spectrograms and a neural vocoder to restore audio. It allows a 3-second prompt to enable zero-shot timbre control. Due to the scarcity of publicly available rap datasets, we also present RapBank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically.
Paper Structure (39 sections, 3 equations, 6 figures, 5 tables)

This paper contains 39 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The overall pipeline of Freestyler. With lyrics and accompaniment as condition, it can generate rapping voice that matches the style and rhythm of the accompaniment.
  • Figure 2: Overview of Freestyler. The lyrics-to-semantic model in (a) predicts semantic tokens based on lyrics and accompaniment. The accompaniment feature is shifted left by $K$ frames to provide additional rhythmic context. The semantic-to-spectrogram model in (b) generates mel-spectrograms from the semantic tokens, which are interpolated to align with the spectrogram's frame rate. Speaker embedding is provided to both models to control the timbre.
  • Figure 3: The extraction process of the accompaniment feature and semantic tokens. Each block in Wav2Vec XLS-R represents 6 attention layers, with accompaniment and vocals going through 6 and 18 layers respectively.
  • Figure 4: The spectrogram of (a) GT accompaniment, (b) GT vocal and (c) Freestyler-generated vocal. Vertical lines are human-annotated beat positions in the accompaniment. The energy of the GT accompaniment is also drawn in (a).
  • Figure 5: The distribution of language and duration of RapBank
  • ...and 1 more figures