Table of Contents
Fetching ...

Audio-visual Generalized Zero-shot Learning the Easy Way

Shentong Mo, Pedro Morgado

TL;DR

This paper tackles AVGZSL, where the goal is to recognize unseen audio-visual categories at test time using semantic information from seen classes. It introduces EZ-AVGZL, a simple framework that forgoes reconstructive cross-modal decoders in favor of a supervised text–audio–visual contrastive objective, leveraging transformed text representations to align modalities ($\mathcal{L}_{AVLA}$). The approach combines Class Embedding Optimization (CEO) to produce well-separated yet semantically consistent class embeddings and Audio-Visual Language Alignment (AVLA) via cross-attention to compute non-linear similarity scores. On VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, EZ-AVGZL achieves state-of-the-art performance, with robust gains across seen and unseen splits and across diverse feature encoders, validating both effectiveness and generalization.

Abstract

Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes a single supervised text audio-visual contrastive loss to learn an alignment between audio-visual and textual modalities, moving away from the conventional approach of reconstructing cross-modal features and text embeddings. Our key insight is that while class name embeddings are well aligned with language-based audio-visual features, they don't provide sufficient class separation to be useful for zero-shot learning. To address this, our method leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks. Our results demonstrate that our EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.

Audio-visual Generalized Zero-shot Learning the Easy Way

TL;DR

This paper tackles AVGZSL, where the goal is to recognize unseen audio-visual categories at test time using semantic information from seen classes. It introduces EZ-AVGZL, a simple framework that forgoes reconstructive cross-modal decoders in favor of a supervised text–audio–visual contrastive objective, leveraging transformed text representations to align modalities (). The approach combines Class Embedding Optimization (CEO) to produce well-separated yet semantically consistent class embeddings and Audio-Visual Language Alignment (AVLA) via cross-attention to compute non-linear similarity scores. On VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, EZ-AVGZL achieves state-of-the-art performance, with robust gains across seen and unseen splits and across diverse feature encoders, validating both effectiveness and generalization.

Abstract

Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes a single supervised text audio-visual contrastive loss to learn an alignment between audio-visual and textual modalities, moving away from the conventional approach of reconstructing cross-modal features and text embeddings. Our key insight is that while class name embeddings are well aligned with language-based audio-visual features, they don't provide sufficient class separation to be useful for zero-shot learning. To address this, our method leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks. Our results demonstrate that our EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.
Paper Structure (11 sections, 5 equations, 3 figures, 10 tables)

This paper contains 11 sections, 5 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Comparison of our EZ-AVGZL with state-of-the-art methods on VGGSound-GZSL, UCF-GZSL, ActivityNet-GZSL benchmarks in terms of Harmonic Mean (higher is better) for seen and unseen classes. Our method significantly outperforms previous baselines in terms of all datasets.
  • Figure 2: Illustration of the proposed Easy Audio-Visual Generalized Zero-shot Learning (EZ-AVGZL) framework. The initial class embbedings ${\bm{t}}_i$ from a frozen text encoder are optimized with maximal separability and preserved semantics to generate the new embeddings ${\bm{w}}_i$. Then the cross-attention transformer $f_\theta^{av}(\cdot,\cdot)$ takes visual and audio features $({\bm{v}}_i, {\bm{a}}_i)$ from the unimodal encoder to generate the multi-modal representations ${\bm{x}}^{av}_i$. Finally, a non-linear similarity function is applied to align representations ${\bm{x}}^{av}_i$ with the corresponding class embeddings ${\bm{w}}_{y_i}$ by minimizing the cross-entropy loss between the predicted similarity score and the target score as 1 in the target class entry, while the target is 0 for multi-modal representations ${\bm{x}}^{av}_j$ given visual features ${\bm{v}}_j$ of videos from other classes.
  • Figure 3: Confusion matrices of zero-shot predictions on VGGSound-GZSL using model with and without class embedding optimization.