Table of Contents
Fetching ...

Expressive Range Characterization of Open Text-to-Audio Models

Jonathan Morse, Azadeh Naderi, Swen Gaudl, Mark Cartwright, Amy K. Hoover, Mark J. Nelson

TL;DR

The paper addresses the challenge of understanding the diversity and fidelity of outputs from text-to-audio models. It adapts expressive range analysis (ERA) from procedural content generation to evaluate model outputs conditioned on fixed prompts, using both per-prompt metrics (e.g., thunderclap timing) and general acoustic-feature analyses via PCA. By prompting three open-source models with ESC-50-based labels and analyzing pitch, loudness, and timbre, the work demonstrates a practical framework for exploratory evaluation and highlights model-specific variation patterns. This ERA-based approach offers game designers and researchers a quantitative, prompt-aware tool to compare and understand the expressive capacity of text-to-audio synthesis systems.

Abstract

Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target. Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators' output space, especially for level generators. This paper adapts ERA to text-to-audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC-50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA-based exploratory evaluation of generative audio models.

Expressive Range Characterization of Open Text-to-Audio Models

TL;DR

The paper addresses the challenge of understanding the diversity and fidelity of outputs from text-to-audio models. It adapts expressive range analysis (ERA) from procedural content generation to evaluate model outputs conditioned on fixed prompts, using both per-prompt metrics (e.g., thunderclap timing) and general acoustic-feature analyses via PCA. By prompting three open-source models with ESC-50-based labels and analyzing pitch, loudness, and timbre, the work demonstrates a practical framework for exploratory evaluation and highlights model-specific variation patterns. This ERA-based approach offers game designers and researchers a quantitative, prompt-aware tool to compare and understand the expressive capacity of text-to-audio synthesis systems.

Abstract

Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target. Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators' output space, especially for level generators. This paper adapts ERA to text-to-audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC-50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA-based exploratory evaluation of generative audio models.

Paper Structure

This paper contains 63 sections, 24 figures, 1 table.

Figures (24)

  • Figure 1: Querying a Text-to-Audio Model. Text-to-audio models operate by taking text prompts as input, mapping them to a joint embedding space that relates text and audio, and generating audio clips as output. This schematic illustrates generating 100 audio outputs each for the prompts "cat meowing plaintively" and "ambient music". The audio for this and all other figures in the paper is available at https://doi.org/10.5281/zenodo.16998750.
  • Figure 2: Spectrograms of Audio Generated with 'Thunder' Prompt. Spectrograms of five 10-second output samples for the single-word prompt thunder. Stable Audio Open in the top row; MMAudio in the bottom row.
  • Figure 3: Loudness of Audio Generated with 'Thunder' Prompt. RMS loudness vs time plots for the same samples that were shown as spectrograms in Figure \ref{['fig:thunder_spectrograms']}. Stable Audio Open in the top row; MMAudio in the bottom row.
  • Figure 4: Expressive Range of Thunderclap Timing and Magnitude for the 'Thunder' Prompt. Relative magnitude vs. timing of the RMS loudness peak for 100 samples generated by each model for the prompt thunder. Relative magnitude is defined as peak loudness divided by average loudness across the 10-second clip. From this, we can see that Stable Audio, but not MMAudio, tends to generate audio with a distinct thunderclap in the first few seconds.
  • Figure 5: crying baby
  • ...and 19 more figures