Table of Contents
Fetching ...

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria

TL;DR

The paper tackles semantic and temporal alignment in diffusion-based text-to-audio generation by creating Audio-alpaca, a large automatically generated preference dataset, and applying direct preference optimization (DPO) to fine-tune Tango on this data. By perturbing prompts and leveraging CLAP-based ranking, the authors generate winner/loser audio pairs and prune noisy data, resulting in Tango 2 that outperforms Tango and AudioLDM2 on both objective metrics (FAD, KL, IS, CLAP) and subjective assessments. A key finding is that temporal augmentation and contrastive learning via DPO help better map prompts to temporally coherent audio, even without new out-of-distribution prompts. The work provides a scalable alignment framework for text-to-audio with Audio-alpaca as a releaseable resource and demonstrates the practical impact of diffusion-based DPO in audio generation.

Abstract

Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

TL;DR

The paper tackles semantic and temporal alignment in diffusion-based text-to-audio generation by creating Audio-alpaca, a large automatically generated preference dataset, and applying direct preference optimization (DPO) to fine-tune Tango on this data. By perturbing prompts and leveraging CLAP-based ranking, the authors generate winner/loser audio pairs and prune noisy data, resulting in Tango 2 that outperforms Tango and AudioLDM2 on both objective metrics (FAD, KL, IS, CLAP) and subjective assessments. A key finding is that temporal augmentation and contrastive learning via DPO help better map prompts to temporally coherent audio, even without new out-of-distribution prompts. The work provides a scalable alignment framework for text-to-audio with Audio-alpaca as a releaseable resource and demonstrates the practical impact of diffusion-based DPO in audio generation.

Abstract

Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.
Paper Structure (35 sections, 12 equations, 4 figures, 6 tables)

This paper contains 35 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An illustration of our pipeline for text-to-audio alignment. The top part depicts the preference dataset creation where three strategies are deployed to generate the undesirable audio outputs to the input prompts. These samples are further filtered to form Audio-alpaca. This preference dataset is finally used to align Tango using DPO-diffusion loss (\ref{['eq:DPO-Diff']}), resulting in Tango 2.
  • Figure 2: The distribution of $\alpha_1$ and $\Delta_1$ scores in the unfiltered dataset. We see that for an unfiltered dataset: i) the winner audio sample is not always strongly aligned to the text prompt in the $\alpha_1$ plot; ii) winner and loser audio samples can be too close in the $\Delta_1$ plot. We thus choose the values of our $\alpha_1$, $\Delta_1$ and other selection parameters to ensure the filtered dataset is less noisy with more separation between the winner and loser audios.
  • Figure 3: CLAP score of the models vs the number of events or concepts in the textual prompt.
  • Figure 4: The impact of filtering Audio-alpaca on performance observed through $\Delta_2$, and $\alpha_2$. The CLAP score of the winning audio must be at least $\alpha_2$ and $\Delta_2$ represents the difference in CLAP scores between the winning audio $x^w$ and the losing audio $x^l$ given a prompt $\tau$.