Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata
TL;DR
This work tackles audio-visual generalized zero-shot learning by leveraging pre-trained multimodal models CLIP (visual) and CLAP (audio) to extract features and dual textual label embeddings. The authors propose a lightweight feed-forward architecture that maps audio-visual inputs and concatenated CLIP/CLAP text embeddings into a shared embedding space, trained with a composite loss to align modalities and label semantics. They demonstrate state-of-the-art harmonic mean performance on VGGSound-GZSL_cls, UCF-GZSL_cls, and ActivityNet-GZSL_cls, with notable ZSL gains and robust ablations confirming the benefits of using both modalities and both text embeddings. The results underscore the practical potential of large multimodal models for generalized zero-shot video understanding, while also acknowledging limitations related to potential data leakage from pretraining sources and reliance on fixed feature extractors.
Abstract
Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP. In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. Furthermore, the CLIP and CLAP text encoders provide class label embeddings which are combined to boost the performance of the system. We propose a simple yet effective model that only relies on feed-forward neural networks, exploiting the strong generalization capabilities of the new audio, visual and textual features. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL with our new features. Code and data available at: https://github.com/dkurzend/ClipClap-GZSL.
