Table of Contents
Fetching ...

DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu

TL;DR

DiveSound tackles the lack of systematic diversity assessment in audio generation by introducing an automated, scalable pipeline that builds a multimodal dataset with a diversified taxonomy guided by LLMs. It combines GPT-4-driven subcategory clustering on VGGSound with an automatic text-audio-image matching pipeline using CLIP and CLAP to produce 35 classes with an average of 2.42 subcategories and over 10K clips. A latent-diffusion-based generator conditioned on fused text, image, and label embeddings demonstrates that visual guidance improves both diversity and quality (lower FAD and higher MSD/MOS) compared with baselines. The approach is scalable and reduces human labeling bias, offering practical benefits for controllable, diverse audio generation across applications.

Abstract

Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.

DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

TL;DR

DiveSound tackles the lack of systematic diversity assessment in audio generation by introducing an automated, scalable pipeline that builds a multimodal dataset with a diversified taxonomy guided by LLMs. It combines GPT-4-driven subcategory clustering on VGGSound with an automatic text-audio-image matching pipeline using CLIP and CLAP to produce 35 classes with an average of 2.42 subcategories and over 10K clips. A latent-diffusion-based generator conditioned on fused text, image, and label embeddings demonstrates that visual guidance improves both diversity and quality (lower FAD and higher MSD/MOS) compared with baselines. The approach is scalable and reduces human labeling bias, offering practical benefits for controllable, diverse audio generation across applications.

Abstract

Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.
Paper Structure (13 sections, 1 equation, 4 figures, 1 table)

This paper contains 13 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Multimodal text-audio-image dataset samples
  • Figure 2: The process of regrouping the VGGSound labels into new sound event labels, in the example of animals. Sound event labels for other overarching categories are also classified using the same form of prompt.
  • Figure 3: The prompt and examples of editing classifications and creating new subcategories. Three important rules are listed.
  • Figure 4: Top: the automated matching process of text-audio data pairs. The example here uses the new class dog to demonstrate how an audio clip is matched with its corresponding text data pair. Bottom: the statistics of the newly selected dataset, including the 35 class labels with an average of 2.42 subcategories.