Extremely Simple Out-of-distribution Detection for Audio-visual Generalized Zero-shot Learning
Yang Liu, Xun Zhang, Jiale Du, Xinbo Gao, Jungong Han
TL;DR
The paper addresses AV-GZSL bias by proposing EZ-AVOOD, an extremely simple OOD-detection framework that leverages class-specific logits and a class-agnostic residual subspace to separate seen from unseen samples without training a dedicated OOD detector. It uses a shared seen-classifier as the OOD detector and an adaptable unseen classifier (based on ClipClap) to realize robust GZSL performance, with the unseen pathway replaceable by other AV-GZSL methods. Experimental results on three audio-visual benchmarks show state-of-the-art harmonic mean and competitive ZSL scores, demonstrating strong cross-dataset effectiveness and practical compatibility. The method offers a lightweight yet powerful approach to mitigating domain shift in AV-GZSL, with clear guidance on hyperparameters and resilience to subspace dimension choices.
Abstract
Zero-shot Learning(ZSL) attains knowledge transfer from seen classes to unseen classes by exploring auxiliary category information, which is a promising yet difficult research topic. In this field, Audio-Visual Generalized Zero-Shot Learning~(AV-GZSL) has aroused researchers' great interest in which intricate relations within triple modalities~(audio, video, and natural language) render this task quite challenging but highly research-worthy. However, both existing embedding-based and generative-based AV-GZSL methods tend to suffer from domain shift problem a lot and we propose an extremely simple Out-of-distribution~(OOD) detection based AV-GZSL method~(EZ-AVOOD) to further mitigate bias problem by differentiating seen and unseen samples at the initial beginning. EZ-AVOOD accomplishes effective seen-unseen separation by exploiting the intrinsic discriminative information held in class-specific logits and class-agnostic feature subspace without training an extra OOD detector network. Followed by seen-unseen binary classification, we employ two expert models to classify seen samples and unseen samples separately. Compared to existing state-of-the-art methods, our model achieves superior ZSL and GZSL performances on three audio-visual datasets and becomes the new SOTA, which comprehensively demonstrates the effectiveness of the proposed EZ-AVOOD.
