LyCon: Lyrics Reconstruction from the Bag-of-Words Using Large Language Models
Haven Kim, Kahyun Choi
TL;DR
This work tackles copyright barriers in lyric research by reconstructing full lyrics from publicly available Bag-of-Words datasets using large-scale text generation. By combining BoW vocabularies with metadata—title/artist for topic, AllMusic genre, and Deezer mood—the authors prompt a language generator to produce lyrics that align with target genre, mood, and vocabulary, while preserving stylistic features. The authors introduce LyCon, a dataset of 7,863 reconstructed lyric sets aligned with MSD metadata, enabling mood- and genre-conditioned lyric generation and lyric-based analyses. Overall, LyCon offers a copyright-safe pathway for lyric research and downstream tasks such as mood-conditioned generation and music taxonomy, supported by quantitative analyses showing structural and aesthetic similarity to originals.
Abstract
This paper addresses the unique challenge of conducting research in lyric studies, where direct use of lyrics is often restricted due to copyright concerns. Unlike typical data, internet-sourced lyrics are frequently protected under copyright law, necessitating alternative approaches. Our study introduces a novel method for generating copyright-free lyrics from publicly available Bag-of-Words (BoW) datasets, which contain the vocabulary of lyrics but not the lyrics themselves. Utilizing metadata associated with BoW datasets and large language models, we successfully reconstructed lyrics. We have compiled and made available a dataset of reconstructed lyrics, LyCon, aligned with metadata from renowned sources including the Million Song Dataset, Deezer Mood Detection Dataset, and AllMusic Genre Dataset, available for public access. We believe that the integration of metadata such as mood annotations or genres enables a variety of academic experiments on lyrics, such as conditional lyric generation.
