Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models
Xuenan Xu, Pingyue Zhang, Ming Yan, Ji Zhang, Mengyue Wu
TL;DR
This work tackles zero-shot audio classification by enriching textual class descriptions with attribute-focused knowledge generated by a large language model. It introduces a general auditory attribute set and uses ChatGPT to produce per-class attribute descriptions, which are then aligned with audio via a contrastive learning framework that uses cosine similarity after FC projections and an InfoNCE-based loss. Empirical results on VGGSound and AudioSet show consistent, significant improvements over baselines across multiple backbones and text encoders, with ablations highlighting the value of high-level semantic attributes and the With-class sampling strategy. The approach demonstrates strong data-efficiency and robustness, offering a practical path to more generalizable audio understanding, while noting limitations in attribute-label alignment and future exploration of additional LLMs.
Abstract
Zero-shot audio classification aims to recognize and classify a sound class that the model has never seen during training. This paper presents a novel approach for zero-shot audio classification using automatically generated sound attribute descriptions. We propose a list of sound attributes and leverage large language model's domain knowledge to generate detailed attribute descriptions for each class. In contrast to previous works that primarily relied on class labels or simple descriptions, our method focuses on multi-dimensional innate auditory attributes, capturing different characteristics of sound classes. Additionally, we incorporate a contrastive learning approach to enhance zero-shot learning from textual labels. We validate the effectiveness of our method on VGGSound and AudioSet\footnote{The code is available at \url{https://www.github.com/wsntxxn/AttrEnhZsAc}.}. Our results demonstrate a substantial improvement in zero-shot classification accuracy. Ablation results show robust performance enhancement, regardless of the model architecture.
