Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization
Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria
TL;DR
The paper tackles semantic and temporal alignment in diffusion-based text-to-audio generation by creating Audio-alpaca, a large automatically generated preference dataset, and applying direct preference optimization (DPO) to fine-tune Tango on this data. By perturbing prompts and leveraging CLAP-based ranking, the authors generate winner/loser audio pairs and prune noisy data, resulting in Tango 2 that outperforms Tango and AudioLDM2 on both objective metrics (FAD, KL, IS, CLAP) and subjective assessments. A key finding is that temporal augmentation and contrastive learning via DPO help better map prompts to temporally coherent audio, even without new out-of-distribution prompts. The work provides a scalable alignment framework for text-to-audio with Audio-alpaca as a releaseable resource and demonstrates the practical impact of diffusion-based DPO in audio generation.
Abstract
Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.
