Table of Contents
Fetching ...

Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning

Yang Liu, Xun Zhang, Jiale Du, Xinbo Gao, Jungong Han

TL;DR

The paper addresses AV-GZSL bias by proposing EZ-AVOOD, an extremely simple OOD-detection framework that leverages class-specific logits and a class-agnostic residual subspace to separate seen from unseen samples without training a dedicated OOD detector. It uses a shared seen-classifier as the OOD detector and an adaptable unseen classifier (based on ClipClap) to realize robust GZSL performance, with the unseen pathway replaceable by other AV-GZSL methods. Experimental results on three audio-visual benchmarks show state-of-the-art harmonic mean and competitive ZSL scores, demonstrating strong cross-dataset effectiveness and practical compatibility. The method offers a lightweight yet powerful approach to mitigating domain shift in AV-GZSL, with clear guidance on hyperparameters and resilience to subspace dimension choices.

Abstract

Zero-shot Learning(ZSL) attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information, which is a promising yet difficult research topic. In this field, Audio-Visual Generalized Zero-Shot Learning~(AV-GZSL) has aroused researchers' great interest in which intricate relations within triple modalities~(audio, video, and natural language) render this task quite challenging but highly research-worthy. However, both existing embedding-based and generative-based AV-GZSL methods tend to suffer from domain shift problem a lot and we propose an extremely simple Out-of-distribution~(OOD) detection based AV-GZSL method~(EZ-AVOOD) to further mitigate bias problem by differentiating seen and unseen samples at the initial beginning. EZ-AVOOD accomplishes effective seen-unseen separation by exploiting the intrinsic discriminative information held in class-specific logits and class-agnostic feature subspace without training an extra OOD detector network. Followed by seen-unseen binary classification, we employ two expert models to classify seen samples and unseen samples separately. Compared to existing state-of-the-art methods, our model achieves superior ZSL and GZSL performances on three audio-visual datasets and becomes the new SOTA, which comprehensively demonstrates the effectiveness of the proposed EZ-AVOOD.

Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning

TL;DR

The paper addresses AV-GZSL bias by proposing EZ-AVOOD, an extremely simple OOD-detection framework that leverages class-specific logits and a class-agnostic residual subspace to separate seen from unseen samples without training a dedicated OOD detector. It uses a shared seen-classifier as the OOD detector and an adaptable unseen classifier (based on ClipClap) to realize robust GZSL performance, with the unseen pathway replaceable by other AV-GZSL methods. Experimental results on three audio-visual benchmarks show state-of-the-art harmonic mean and competitive ZSL scores, demonstrating strong cross-dataset effectiveness and practical compatibility. The method offers a lightweight yet powerful approach to mitigating domain shift in AV-GZSL, with clear guidance on hyperparameters and resilience to subspace dimension choices.

Abstract

Zero-shot Learning(ZSL) attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information, which is a promising yet difficult research topic. In this field, Audio-Visual Generalized Zero-Shot Learning~(AV-GZSL) has aroused researchers' great interest in which intricate relations within triple modalities~(audio, video, and natural language) render this task quite challenging but highly research-worthy. However, both existing embedding-based and generative-based AV-GZSL methods tend to suffer from domain shift problem a lot and we propose an extremely simple Out-of-distribution~(OOD) detection based AV-GZSL method~(EZ-AVOOD) to further mitigate bias problem by differentiating seen and unseen samples at the initial beginning. EZ-AVOOD accomplishes effective seen-unseen separation by exploiting the intrinsic discriminative information held in class-specific logits and class-agnostic feature subspace without training an extra OOD detector network. Followed by seen-unseen binary classification, we employ two expert models to classify seen samples and unseen samples separately. Compared to existing state-of-the-art methods, our model achieves superior ZSL and GZSL performances on three audio-visual datasets and becomes the new SOTA, which comprehensively demonstrates the effectiveness of the proposed EZ-AVOOD.

Paper Structure

This paper contains 26 sections, 16 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Harmonic mean (%) evaluating GZSL performance of our EZ-AVOOD model and other comparison methods on three datasets. EZ-AVOOD (the red bar) consistently outperforms the rest opponents with a lead margin up to 5% on the UCF-GZSL benchmark.
  • Figure 2: The general framework of EZ-AVOOD. Four key modules "Feature Extractor", "OOD Detector", "Seen Classifier" and "Unseen Classifier" make up the complete model. Parameter-fixed feature extractor simply produces audio-visual features $\boldsymbol{a} \oplus \boldsymbol{v}$ ($\oplus$ represents concatenation operation) and text embeddings $\boldsymbol{t}$ without further optimization. Seen classifier and OOD detector are implemented with two identical MLPs, which means they share the same copy of parameters and need to train only one of them to make two modules work. The process of OOD score formulation is illustrated in Figure \ref{['score']}. At evaluation stage, OOD detector distinguishes seen and unseen samples and input them to the trained seen expert and unseen expert classifiers respectively (red arrows).
  • Figure 3: The process of the EZ-OOD score formulation. During training phase, "Residual Subspace" is derived from the eigen-decomposition on all seen samples features matrix. At test time (pink arrows), concatenated audio-visual feature $\boldsymbol{a} \oplus \boldsymbol{v}$ projects onto the residual subspace to get "Residual Score" and "Energy Score" is calculated with the logits of the test sample produced by the MLP (the trained seen classifier actually). The final OOD score is defined by the weighted sum of energy score and residual score.
  • Figure 4: ROC curves of EZ-OOD, Energy Score, and Residual Score on three datasets. Evidently, the full EZ-OOD consistently outperforms Energy Score and Residual Score with larger AUROC metric.
  • Figure 5: Effect of scaling factor $\gamma$ on AUROC for three datasets. OOD detection performance of EZ-OOD reaches the top when energy score and residual score are properly matched with the linear combination scaled by a suitable $\gamma$.
  • ...and 1 more figures