Table of Contents
Fetching ...

RiTTA: Modeling Event Relations in Text-to-Audio Generation

Yuhang He, Yash Jain, Xubo Liu, Andrew Markham, Vibhav Vineet

TL;DR

RiTTA tackles the notable gap in Text-to-Audio generation: modeling the relations between audio events described in text. It introduces a comprehensive benchmark comprising an audio event relation corpus, an audio event category corpus, and a seed-audio based <textprompt,audio> pair generation pipeline, along with a multi-stage MSR-RiTTA evaluation framework to assess presence, relational correctness, and parsimony. The authors demonstrate that state-of-the-art TTA models struggle to capture audio-event relations and show that a targeted finetuning approach on Tango markedly improves relation modeling, even with limited data. This work provides a practical foundation for evaluating and improving relational reasoning in TTA systems, with implications for realistic acoustic scene generation and downstream applications in VR, cinema, and immersive media.

Abstract

Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA

RiTTA: Modeling Event Relations in Text-to-Audio Generation

TL;DR

RiTTA tackles the notable gap in Text-to-Audio generation: modeling the relations between audio events described in text. It introduces a comprehensive benchmark comprising an audio event relation corpus, an audio event category corpus, and a seed-audio based <textprompt,audio> pair generation pipeline, along with a multi-stage MSR-RiTTA evaluation framework to assess presence, relational correctness, and parsimony. The authors demonstrate that state-of-the-art TTA models struggle to capture audio-event relations and show that a targeted finetuning approach on Tango markedly improves relation modeling, even with limited data. This work provides a practical foundation for evaluating and improving relational reasoning in TTA systems, with implications for realistic acoustic scene generation and downstream applications in VR, cinema, and immersive media.

Abstract

Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA

Paper Structure

This paper contains 19 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: RiTTA Motivation: The acoustic world is rich with diverse audio events that exhibit various relationships. While text can precisely describe these relationships (Fig. A), current TTA models struggle to capture both the audio events and the relations conveyed by the text (Fig. B). This challenge motivates us to systematically study RiTTA.
  • Figure 2: Audio Events Relation Corpus.
  • Figure 2: GPT-4 augmented prompts (before relation).
  • Figure 3: Relation aware <textprompt,audio> pair creation pipeline. It introduces large diversity in both text prompt and audio.
  • Figure 4: relation aware evaluation. Audio event detection model is applied to get audio events. The meta data of each event contains start time $t_1$, end time $t_2$, confidence score $s$ and class label $c$. Various relations can be discovered from these audio events.
  • ...and 4 more figures