Table of Contents
Fetching ...

Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models

Xuenan Xu, Pingyue Zhang, Ming Yan, Ji Zhang, Mengyue Wu

TL;DR

This work tackles zero-shot audio classification by enriching textual class descriptions with attribute-focused knowledge generated by a large language model. It introduces a general auditory attribute set and uses ChatGPT to produce per-class attribute descriptions, which are then aligned with audio via a contrastive learning framework that uses cosine similarity after FC projections and an InfoNCE-based loss. Empirical results on VGGSound and AudioSet show consistent, significant improvements over baselines across multiple backbones and text encoders, with ablations highlighting the value of high-level semantic attributes and the With-class sampling strategy. The approach demonstrates strong data-efficiency and robustness, offering a practical path to more generalizable audio understanding, while noting limitations in attribute-label alignment and future exploration of additional LLMs.

Abstract

Zero-shot audio classification aims to recognize and classify a sound class that the model has never seen during training. This paper presents a novel approach for zero-shot audio classification using automatically generated sound attribute descriptions. We propose a list of sound attributes and leverage large language model's domain knowledge to generate detailed attribute descriptions for each class. In contrast to previous works that primarily relied on class labels or simple descriptions, our method focuses on multi-dimensional innate auditory attributes, capturing different characteristics of sound classes. Additionally, we incorporate a contrastive learning approach to enhance zero-shot learning from textual labels. We validate the effectiveness of our method on VGGSound and AudioSet\footnote{The code is available at \url{https://www.github.com/wsntxxn/AttrEnhZsAc}.}. Our results demonstrate a substantial improvement in zero-shot classification accuracy. Ablation results show robust performance enhancement, regardless of the model architecture.

Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models

TL;DR

This work tackles zero-shot audio classification by enriching textual class descriptions with attribute-focused knowledge generated by a large language model. It introduces a general auditory attribute set and uses ChatGPT to produce per-class attribute descriptions, which are then aligned with audio via a contrastive learning framework that uses cosine similarity after FC projections and an InfoNCE-based loss. Empirical results on VGGSound and AudioSet show consistent, significant improvements over baselines across multiple backbones and text encoders, with ablations highlighting the value of high-level semantic attributes and the With-class sampling strategy. The approach demonstrates strong data-efficiency and robustness, offering a practical path to more generalizable audio understanding, while noting limitations in attribute-label alignment and future exploration of additional LLMs.

Abstract

Zero-shot audio classification aims to recognize and classify a sound class that the model has never seen during training. This paper presents a novel approach for zero-shot audio classification using automatically generated sound attribute descriptions. We propose a list of sound attributes and leverage large language model's domain knowledge to generate detailed attribute descriptions for each class. In contrast to previous works that primarily relied on class labels or simple descriptions, our method focuses on multi-dimensional innate auditory attributes, capturing different characteristics of sound classes. Additionally, we incorporate a contrastive learning approach to enhance zero-shot learning from textual labels. We validate the effectiveness of our method on VGGSound and AudioSet\footnote{The code is available at \url{https://www.github.com/wsntxxn/AttrEnhZsAc}.}. Our results demonstrate a substantial improvement in zero-shot classification accuracy. Ablation results show robust performance enhancement, regardless of the model architecture.
Paper Structure (12 sections, 7 equations, 3 figures, 4 tables)

This paper contains 12 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The contrastive learning framework for zero-shot audio classification. The text is the description of sound attributes for the class generated by ChatGPT. During training, attributes are randomly selected to form the description while all attributes are used during inference.
  • Figure 2: VGGSound zero-shot classification performance with each attribute included during training.
  • Figure 3: Performance enhancement brought by adding attributes. Green and red denote ground truth and misclassified classes. The corresponding attribute description is also presented. Ono.= onomatopoeia, simi.=simile, emo.=emotion.