Table of Contents
Fetching ...

Boosting Audio-visual Zero-shot Learning with Large Language Models

Haoxing Chen, Yaohui Li, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Jun Lan, Huijia Zhu, Weiqiang Wang

TL;DR

The paper tackles AVZSL by injecting external knowledge from large language models to generate rich, discriminative descriptions of event concepts and by aligning audio-visual features with these knowledge representations in a shared space. It introduces a knowledge-aware adaptive margin loss to strengthen inter-class separability based on knowledge similarities, along with an alignment loss to enforce intra-class compactness. Empirical results on three AVZSL benchmarks show state-of-the-art performance across main and classification feature settings, with ablations validating the impact of LLM-generated descriptions and the proposed losses. The work offers a simple yet effective framework for leveraging external knowledge to improve zero-shot generalization in multimodal video understanding, with code available at the project repository.

Abstract

Audio-visual zero-shot learning aims to recognize unseen classes based on paired audio-visual sequences. Recent methods mainly focus on learning multi-modal features aligned with class names to enhance the generalization ability to unseen categories. However, these approaches ignore the obscure event concepts in class names and may inevitably introduce complex network structures with difficult training objectives. In this paper, we introduce a straightforward yet efficient framework called KnowleDge-Augmented audio-visual learning (KDA), which aids the model in more effectively learning novel event content by leveraging an external knowledge base. Specifically, we first propose to utilize the knowledge contained in large language models (LLMs) to generate numerous descriptive sentences that include important distinguishing audio-visual features of event classes, which helps to better understand unseen categories. Furthermore, we propose a knowledge-aware adaptive margin loss to help distinguish similar events, further improving the generalization ability towards unseen classes. Extensive experimental results demonstrate that our proposed KDA can outperform state-of-the-art methods on three popular audio-visual zero-shot learning datasets.Our code will be avaliable at \url{https://github.com/chenhaoxing/KDA}.

Boosting Audio-visual Zero-shot Learning with Large Language Models

TL;DR

The paper tackles AVZSL by injecting external knowledge from large language models to generate rich, discriminative descriptions of event concepts and by aligning audio-visual features with these knowledge representations in a shared space. It introduces a knowledge-aware adaptive margin loss to strengthen inter-class separability based on knowledge similarities, along with an alignment loss to enforce intra-class compactness. Empirical results on three AVZSL benchmarks show state-of-the-art performance across main and classification feature settings, with ablations validating the impact of LLM-generated descriptions and the proposed losses. The work offers a simple yet effective framework for leveraging external knowledge to improve zero-shot generalization in multimodal video understanding, with code available at the project repository.

Abstract

Audio-visual zero-shot learning aims to recognize unseen classes based on paired audio-visual sequences. Recent methods mainly focus on learning multi-modal features aligned with class names to enhance the generalization ability to unseen categories. However, these approaches ignore the obscure event concepts in class names and may inevitably introduce complex network structures with difficult training objectives. In this paper, we introduce a straightforward yet efficient framework called KnowleDge-Augmented audio-visual learning (KDA), which aids the model in more effectively learning novel event content by leveraging an external knowledge base. Specifically, we first propose to utilize the knowledge contained in large language models (LLMs) to generate numerous descriptive sentences that include important distinguishing audio-visual features of event classes, which helps to better understand unseen categories. Furthermore, we propose a knowledge-aware adaptive margin loss to help distinguish similar events, further improving the generalization ability towards unseen classes. Extensive experimental results demonstrate that our proposed KDA can outperform state-of-the-art methods on three popular audio-visual zero-shot learning datasets.Our code will be avaliable at \url{https://github.com/chenhaoxing/KDA}.
Paper Structure (15 sections, 6 equations, 4 figures, 6 tables)

This paper contains 15 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Inspired by the fact that detailed descriptions can help people understand novel concepts and distinguish similar event contents, we propose to improve model generalization ability based on external knowledge.
  • Figure 2: We perform audio-visual zero-shot classification experiments on three benchmark datasets. We can find that textual descriptions with richer knowledge improve the generalization ability of models.
  • Figure 3: Overview of our proposed KnowleDge-Augmented audio-visual learning (KDA). KDA takes the audio and visual features extracted from the video data as input, and obtains multi-modal audio-visual features $\rho_{av}$ for classification through the cross-attention module and embedding layer. The knowledge description is obtained through the interpretation of the event name by LLMs, and then the knowledge representation $t$ is obtained by using the CLIP text encoder. We use alignment loss $\mathcal{L}_{align}$ to promote intra-class compactness learning, utilizing knowledge-aware adaptive loss $\mathcal{L}_{kaml}$ to enhance inter-class separability learning.
  • Figure 4: t-SNE visualisation for five seen and two unseen test classes from the UCF-GZSL dataset, showing audio and visual input embeddings extracted with SeLaVi Asano_Patrick_Rupprecht_Vedaldi_2020, and learned audio-visual embeddings in the common space. Knowledge embeddings are visualised with a square. KDA facilitates pulling together features from the same parent class while pushing away features belonging to different parent classes.