Table of Contents
Fetching ...

BATON: Aligning Text-to-Audio Model with Human Preference Feedback

Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Zunnan Xu, Qinmei Xu, Jingquan Liu, Jiasheng Lu, Xiu Li

TL;DR

BATON tackles the problem of aligning text-to-audio generation with human preferences by introducing a three-stage framework: (i) constructing a text-audio dataset with human annotations, (ii) training an audio reward model to mimic human alignment, and (iii) fine-tuning a diffusion-based TTA pipeline with reward-weighted likelihood while regularizing toward the pre-trained baseline. The approach leverages synthetic prompts generated via GPT-4, human judgments on 2-label integrity and 3-label temporal tasks, and a CLAP-based reward criterion to guide optimization. Empirical results show notable improvements in both objective (CLAP, FD, FAD, KL) and subjective (MOS-Q, MOS-F) measures, with clear gains in audio integrity and temporal ordering. The work also provides ablations demonstrating the value of combining human and reward-model data and discusses limitations and avenues for future online, reinforcement-learning–driven alignment with human feedback.

Abstract

With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, a framework designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference.

BATON: Aligning Text-to-Audio Model with Human Preference Feedback

TL;DR

BATON tackles the problem of aligning text-to-audio generation with human preferences by introducing a three-stage framework: (i) constructing a text-audio dataset with human annotations, (ii) training an audio reward model to mimic human alignment, and (iii) fine-tuning a diffusion-based TTA pipeline with reward-weighted likelihood while regularizing toward the pre-trained baseline. The approach leverages synthetic prompts generated via GPT-4, human judgments on 2-label integrity and 3-label temporal tasks, and a CLAP-based reward criterion to guide optimization. Empirical results show notable improvements in both objective (CLAP, FD, FAD, KL) and subjective (MOS-Q, MOS-F) measures, with clear gains in audio integrity and temporal ordering. The work also provides ablations demonstrating the value of combining human and reward-model data and discusses limitations and avenues for future online, reinforcement-learning–driven alignment with human feedback.

Abstract

With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, a framework designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference.
Paper Structure (27 sections, 8 equations, 10 figures, 13 tables)

This paper contains 27 sections, 8 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Showcases of two-label and three-label audio samples, with the left indicating alignment and the right indicating misalignment with prompts.
  • Figure 2: The framework of BATON. BATON integrates three modules: (1) An audio generation unit using LLM-augmented prompts, with human-scored annotations; (2) A reward model trained on synthetic data to emulate human alignment preference; (3) A fine-tuning mechanism that enhance the original generative model using reward model combined human-labeled and pre-training datasets.
  • Figure 3: Generated samples comparison of TANGO (original model) and BATON (finetuned model). The left two samples in the display are from the original model, while the right two are from the post-finetuned model. Comparisons (a) and (b) show that the finetuned model produces complete audio events, unlike the original model which omits certain audio event. In comparisons (c) and (d), the original model generates audio with a confused sequence, whereas the finetuned model adheres to the sequence of prompt.
  • Figure 4: Prediction distribution of audio reward models.
  • Figure 5: Screenshot of annotaion system.
  • ...and 5 more figures