Table of Contents
Fetching ...

Long-Form Text-to-Music Generation with Adaptive Prompts: A Case Study in Tabletop Role-Playing Games Soundtracks

Felipe Marra, Lucas N. Ferreira

TL;DR

The paper addresses long-form, time-varying soundtrack generation for TRPGs using text-to-audio models. It introduces Babel Bardo, an LLM-driven pipeline that converts speech transcripts into music descriptions every 30 seconds, which condition a text-to-music model to produce continuous audio segments; four variants are evaluated across two campaigns in English and Portuguese. Key findings show that detailed music descriptions improve audio quality, and maintaining consistency across subsequent descriptions yields smoother transitions, with emotion-based signals delivering strongest narrative alignment. The approach demonstrates the potential of adaptive prompts for coherent, story-driven soundtracks, and suggests future work on long-term consistency and user-based evaluations.

Abstract

This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.

Long-Form Text-to-Music Generation with Adaptive Prompts: A Case Study in Tabletop Role-Playing Games Soundtracks

TL;DR

The paper addresses long-form, time-varying soundtrack generation for TRPGs using text-to-audio models. It introduces Babel Bardo, an LLM-driven pipeline that converts speech transcripts into music descriptions every 30 seconds, which condition a text-to-music model to produce continuous audio segments; four variants are evaluated across two campaigns in English and Portuguese. Key findings show that detailed music descriptions improve audio quality, and maintaining consistency across subsequent descriptions yields smoother transitions, with emotion-based signals delivering strongest narrative alignment. The approach demonstrates the potential of adaptive prompts for coherent, story-driven soundtracks, and suggests future work on long-term consistency and user-based evaluations.

Abstract

This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.

Paper Structure

This paper contains 7 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: At every 30 seconds of gameplay, Babel Bardo transcribes the players' speeches into a text $s_i$ using a Speech Recognition system and uses a Large Language Model (LLM) to map $s_i$ into a music description $d_i$ that matches the scene described by the players. This music description is given to a Text-to-Music system that generates a 30-second piece $a_i$ directly in the audio domain.
  • Figure 2: The transition KLD is computed between the 10 seconds before and after every transition moment $t_i$.
  • Figure :