Table of Contents
Fetching ...

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Xi Wang, Sheng Zhao, Lei Xie

TL;DR

CosyAudio tackles data-quality bottlenecks in TTA by integrating AudioCapTeller that jointly captions audio and outputs caption confidence scores, enabling a quality-aware audio generator. The framework combines well-labeled and weakly-labeled corpora via a self-evolving training strategy with four stages, including direct preference optimization (DPO) for caption refinement. Empirical results on AudioCaps, Clotho, and WavCaps show CosyAudio outperforms state-of-the-art AAC and TTA models on caption fidelity and generalizes across diverse scenarios; it also demonstrates robust confidence-score-based caption assessment and improved audio generation via synthetic captions and quality-aware conditioning. The work suggests a practical path to scalable TTA using synthetic captions and self-improvement, with potential impact on real-world content creation and multimedia AI systems.

Abstract

Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores and synthetic captions to enhance the quality of audio generation. CosyAudio consists of two core components: AudioCapTeller and an audio generator. AudioCapTeller generates synthetic captions for audio and provides confidence scores to evaluate their accuracy. The audio generator uses these synthetic captions and confidence scores to enable quality-aware audio generation. Additionally, we introduce a self-evolving training strategy that iteratively optimizes CosyAudio across both well-labeled and weakly-labeled datasets. Initially trained with well-labeled data, AudioCapTeller leverages its assessment capabilities on weakly-labeled datasets for high-quality filtering and reinforcement learning, which further improves its performance. The well-trained AudioCapTeller refines corpora by generating new captions and confidence scores, serving for the audio generator training. Extensive experiments on open-source datasets demonstrate that CosyAudio outperforms existing models in automated audio captioning, generates more faithful audio, and exhibits strong generalization across diverse scenarios.

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

TL;DR

CosyAudio tackles data-quality bottlenecks in TTA by integrating AudioCapTeller that jointly captions audio and outputs caption confidence scores, enabling a quality-aware audio generator. The framework combines well-labeled and weakly-labeled corpora via a self-evolving training strategy with four stages, including direct preference optimization (DPO) for caption refinement. Empirical results on AudioCaps, Clotho, and WavCaps show CosyAudio outperforms state-of-the-art AAC and TTA models on caption fidelity and generalizes across diverse scenarios; it also demonstrates robust confidence-score-based caption assessment and improved audio generation via synthetic captions and quality-aware conditioning. The work suggests a practical path to scalable TTA using synthetic captions and self-improvement, with potential impact on real-world content creation and multimedia AI systems.

Abstract

Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores and synthetic captions to enhance the quality of audio generation. CosyAudio consists of two core components: AudioCapTeller and an audio generator. AudioCapTeller generates synthetic captions for audio and provides confidence scores to evaluate their accuracy. The audio generator uses these synthetic captions and confidence scores to enable quality-aware audio generation. Additionally, we introduce a self-evolving training strategy that iteratively optimizes CosyAudio across both well-labeled and weakly-labeled datasets. Initially trained with well-labeled data, AudioCapTeller leverages its assessment capabilities on weakly-labeled datasets for high-quality filtering and reinforcement learning, which further improves its performance. The well-trained AudioCapTeller refines corpora by generating new captions and confidence scores, serving for the audio generator training. Extensive experiments on open-source datasets demonstrate that CosyAudio outperforms existing models in automated audio captioning, generates more faithful audio, and exhibits strong generalization across diverse scenarios.

Paper Structure

This paper contains 21 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of the proposed CosyAudio. We improve audio generation with confidence scores and synthetic captions, where the AudioCapTeller generates captions and confidence scores from audio and the audio generator synthesizes audio from input captions and confidence scores.
  • Figure 2: Model structure of AudioCapTeller. We use learnable queries to connect audio and text modalities, enabling the generation of captions and confidence scores from audio.
  • Figure 3: The process of self-evolving training. We use self-evolving training to iteratively optimize AudioCapTeller on well-labeled and weakly-labeled corpora through its caption generation and assessment capabilities
  • Figure 5: Confidence scores distribution of different evaluation models.
  • Figure : (a) Correlation visualization using CLAP scores
  • ...and 2 more figures