Multi-label Zero-Shot Audio Classification with Temporal Attention
Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen
TL;DR
This work tackles multi-label zero-shot audio classification by introducing a temporal attention mechanism that weights different audio segments according to their semantic compatibility with target classes. Acoustic and semantic embeddings are projected into a common space, and per-segment compatibility scores are tempered by attention weights before being fused into final class predictions. A hinge-based, ranking-aware loss (warp_loss) trains the model to rank true classes above negatives, while a time-wise softmax produces segment-level importance for each class. Experiments on a subset of AudioSet show that temporal attention outperforms uniformly aggregated features and a zero-rule baseline, with the supervised setting providing an upper bound; results indicate improved handling of multi-label and imbalanced class scenarios and promise for zero-shot sound event detection.
Abstract
Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features without weighting, which treat all audio segments equally. We evaluate our approach on a subset of AudioSet against a zero-shot model using uniformly aggregated acoustic features, a zero-rule baseline, and the proposed method in the supervised scenario. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.
