RiTTA: Modeling Event Relations in Text-to-Audio Generation
Yuhang He, Yash Jain, Xubo Liu, Andrew Markham, Vibhav Vineet
TL;DR
RiTTA tackles the notable gap in Text-to-Audio generation: modeling the relations between audio events described in text. It introduces a comprehensive benchmark comprising an audio event relation corpus, an audio event category corpus, and a seed-audio based <textprompt,audio> pair generation pipeline, along with a multi-stage MSR-RiTTA evaluation framework to assess presence, relational correctness, and parsimony. The authors demonstrate that state-of-the-art TTA models struggle to capture audio-event relations and show that a targeted finetuning approach on Tango markedly improves relation modeling, even with limited data. This work provides a practical foundation for evaluating and improving relational reasoning in TTA systems, with implications for realistic acoustic scene generation and downstream applications in VR, cinema, and immersive media.
Abstract
Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA
