Table of Contents
Fetching ...

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

David Kurzendörfer, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

TL;DR

This work tackles audio-visual generalized zero-shot learning by leveraging pre-trained multimodal models CLIP (visual) and CLAP (audio) to extract features and dual textual label embeddings. The authors propose a lightweight feed-forward architecture that maps audio-visual inputs and concatenated CLIP/CLAP text embeddings into a shared embedding space, trained with a composite loss to align modalities and label semantics. They demonstrate state-of-the-art harmonic mean performance on VGGSound-GZSL_cls, UCF-GZSL_cls, and ActivityNet-GZSL_cls, with notable ZSL gains and robust ablations confirming the benefits of using both modalities and both text embeddings. The results underscore the practical potential of large multimodal models for generalized zero-shot video understanding, while also acknowledging limitations related to potential data leakage from pretraining sources and reliance on fixed feature extractors.

Abstract

Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP. In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. Furthermore, the CLIP and CLAP text encoders provide class label embeddings which are combined to boost the performance of the system. We propose a simple yet effective model that only relies on feed-forward neural networks, exploiting the strong generalization capabilities of the new audio, visual and textual features. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL with our new features. Code and data available at: https://github.com/dkurzend/ClipClap-GZSL.

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

TL;DR

This work tackles audio-visual generalized zero-shot learning by leveraging pre-trained multimodal models CLIP (visual) and CLAP (audio) to extract features and dual textual label embeddings. The authors propose a lightweight feed-forward architecture that maps audio-visual inputs and concatenated CLIP/CLAP text embeddings into a shared embedding space, trained with a composite loss to align modalities and label semantics. They demonstrate state-of-the-art harmonic mean performance on VGGSound-GZSL_cls, UCF-GZSL_cls, and ActivityNet-GZSL_cls, with notable ZSL gains and robust ablations confirming the benefits of using both modalities and both text embeddings. The results underscore the practical potential of large multimodal models for generalized zero-shot video understanding, while also acknowledging limitations related to potential data leakage from pretraining sources and reliance on fixed feature extractors.

Abstract

Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP. In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. Furthermore, the CLIP and CLAP text encoders provide class label embeddings which are combined to boost the performance of the system. We propose a simple yet effective model that only relies on feed-forward neural networks, exploiting the strong generalization capabilities of the new audio, visual and textual features. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL with our new features. Code and data available at: https://github.com/dkurzend/ClipClap-GZSL.
Paper Structure (12 sections, 9 equations, 3 figures, 4 tables)

This paper contains 12 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Our framework for audio-visual GZSL maps the audio and visual data to embeddings that are aligned with class label embeddings that are obtained from merging CLIP and CLAP embeddings. The class label embedding that is closest to the audio-visual embedding determines the class prediction. At test time, the set of class label embeddings contains both seen and unseen classes.
  • Figure 2: The image and audio encoders of CLIP and CLAP are used to extract features from the raw input which are concatenated and passed through multiple feed-forward networks to get an audio-visual output embedding $\theta_o$. Likewise, the text encoders of CLIP and CLAP are used to extract textual label embeddings. They are passed through a series of neural networks to obtain a learned class label embedding $\theta_w$. Both $\theta_o$ and $\theta_w$ reside in a joint embedding space.
  • Figure 3: t-SNE visualizations for the audio input features (left), visual input features (center), and the learned output embeddings for our model (right) for the $\text{ActivityNet-GZSL}^{cls}$ (top), $\text{UCF-GZSL}^{cls}$ (center) and $\text{VGGSound-GZSL}^{cls}$ (bottom) datasets for two unseen classes and four seen classes. The learned class text embeddings are visualized as diamonds.