Long-Form Text-to-Music Generation with Adaptive Prompts: A Case Study in Tabletop Role-Playing Games Soundtracks
Felipe Marra, Lucas N. Ferreira
TL;DR
The paper addresses long-form, time-varying soundtrack generation for TRPGs using text-to-audio models. It introduces Babel Bardo, an LLM-driven pipeline that converts speech transcripts into music descriptions every 30 seconds, which condition a text-to-music model to produce continuous audio segments; four variants are evaluated across two campaigns in English and Portuguese. Key findings show that detailed music descriptions improve audio quality, and maintaining consistency across subsequent descriptions yields smoother transitions, with emotion-based signals delivering strongest narrative alignment. The approach demonstrates the potential of adaptive prompts for coherent, story-driven soundtracks, and suggests future work on long-term consistency and user-based evaluations.
Abstract
This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.
