Table of Contents
Fetching ...

Syllable-level lyrics generation from melody exploiting character-level language model

Zhe Zhang, Karol Lasocki, Yi Yu, Atsuhiro Takasu

TL;DR

This work tackles syllable-level lyrics generation conditioned on melody by integrating a melody-encoder-syllable-decoder Transformer with a fine-tuned character-level language model (CANINE) to re-rank beam-search candidates. A dataset of over 2 million syllable-continuation examples is constructed from yu_conditional_2021 to train NSP-style fine-tuning, including challenging negatives to improve robustness. The approach uses a weighted combination of transformer and language-model probabilities during beam search with cumulative scoring to maintain long-range coherence, and it is evaluated via objective metrics, ChatGPT-based assessments, and human judgments, showing improved coherence and correctness while maintaining musical alignment. The study demonstrates that grounding lyric generation with a pre-trained character-level LM can enhance quality without training new large models, and it discusses limitations and future directions, such as larger syllable-level corpora, end-to-end training, and exploring alternative pre-trained models.

Abstract

The generation of lyrics tightly connected to accompanying melodies involves establishing a mapping between musical notes and syllables of lyrics. This process requires a deep understanding of music constraints and semantic patterns at syllable-level, word-level, and sentence-level semantic meanings. However, pre-trained language models specifically designed at the syllable level are publicly unavailable. To solve these challenging issues, we propose to exploit fine-tuning character-level language models for syllable-level lyrics generation from symbolic melody. In particular, our method endeavors to incorporate linguistic knowledge of the language model into the beam search process of a syllable-level Transformer generator network. Additionally, by exploring ChatGPT-based evaluation for generated lyrics, along with human subjective evaluation, we demonstrate that our approach enhances the coherence and correctness of the generated lyrics, eliminating the need to train expensive new language models.

Syllable-level lyrics generation from melody exploiting character-level language model

TL;DR

This work tackles syllable-level lyrics generation conditioned on melody by integrating a melody-encoder-syllable-decoder Transformer with a fine-tuned character-level language model (CANINE) to re-rank beam-search candidates. A dataset of over 2 million syllable-continuation examples is constructed from yu_conditional_2021 to train NSP-style fine-tuning, including challenging negatives to improve robustness. The approach uses a weighted combination of transformer and language-model probabilities during beam search with cumulative scoring to maintain long-range coherence, and it is evaluated via objective metrics, ChatGPT-based assessments, and human judgments, showing improved coherence and correctness while maintaining musical alignment. The study demonstrates that grounding lyric generation with a pre-trained character-level LM can enhance quality without training new large models, and it discusses limitations and future directions, such as larger syllable-level corpora, end-to-end training, and exploring alternative pre-trained models.

Abstract

The generation of lyrics tightly connected to accompanying melodies involves establishing a mapping between musical notes and syllables of lyrics. This process requires a deep understanding of music constraints and semantic patterns at syllable-level, word-level, and sentence-level semantic meanings. However, pre-trained language models specifically designed at the syllable level are publicly unavailable. To solve these challenging issues, we propose to exploit fine-tuning character-level language models for syllable-level lyrics generation from symbolic melody. In particular, our method endeavors to incorporate linguistic knowledge of the language model into the beam search process of a syllable-level Transformer generator network. Additionally, by exploring ChatGPT-based evaluation for generated lyrics, along with human subjective evaluation, we demonstrate that our approach enhances the coherence and correctness of the generated lyrics, eliminating the need to train expensive new language models.
Paper Structure (16 sections, 5 equations, 4 figures, 4 tables)

This paper contains 16 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Transformer-based melody-encoder-syllable-decoder architecture exploiting character-level language model.
  • Figure 2: Generated sheet music.
  • Figure 3: Correlation between ChatGPT-based evaluation and human evaluation of generated lyrics.
  • Figure 4: Results of subjective evaluation of lyrics generation from melody.