Table of Contents
Fetching ...

Multi-label Zero-Shot Audio Classification with Temporal Attention

Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen

TL;DR

This work tackles multi-label zero-shot audio classification by introducing a temporal attention mechanism that weights different audio segments according to their semantic compatibility with target classes. Acoustic and semantic embeddings are projected into a common space, and per-segment compatibility scores are tempered by attention weights before being fused into final class predictions. A hinge-based, ranking-aware loss (warp_loss) trains the model to rank true classes above negatives, while a time-wise softmax produces segment-level importance for each class. Experiments on a subset of AudioSet show that temporal attention outperforms uniformly aggregated features and a zero-rule baseline, with the supervised setting providing an upper bound; results indicate improved handling of multi-label and imbalanced class scenarios and promise for zero-shot sound event detection.

Abstract

Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features without weighting, which treat all audio segments equally. We evaluate our approach on a subset of AudioSet against a zero-shot model using uniformly aggregated acoustic features, a zero-rule baseline, and the proposed method in the supervised scenario. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.

Multi-label Zero-Shot Audio Classification with Temporal Attention

TL;DR

This work tackles multi-label zero-shot audio classification by introducing a temporal attention mechanism that weights different audio segments according to their semantic compatibility with target classes. Acoustic and semantic embeddings are projected into a common space, and per-segment compatibility scores are tempered by attention weights before being fused into final class predictions. A hinge-based, ranking-aware loss (warp_loss) trains the model to rank true classes above negatives, while a time-wise softmax produces segment-level importance for each class. Experiments on a subset of AudioSet show that temporal attention outperforms uniformly aggregated features and a zero-rule baseline, with the supervised setting providing an upper bound; results indicate improved handling of multi-label and imbalanced class scenarios and promise for zero-shot sound event detection.

Abstract

Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features without weighting, which treat all audio segments equally. We evaluate our approach on a subset of AudioSet against a zero-shot model using uniformly aggregated acoustic features, a zero-rule baseline, and the proposed method in the supervised scenario. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.
Paper Structure (13 sections, 10 equations, 2 figures, 4 tables)

This paper contains 13 sections, 10 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of the proposed multi-label zero-shot model. Training in solid and testing in dashed line. Acoustic and semantic embeddings are extracted from audio samples and class labels using pre-trained embedding models. The embeddings are projected into a common space, combined using a compatibility measure, and weighted by a temporal attention module. The weighted scores are aggregated to produce final prediction scores. During testing, model processes unknown audio samples and new class labels, predicts the classes with highest scores.
  • Figure 2: Strong labels (top) and attention weights (bottom) for class label dog of the audio sample "-4gqARaEJE_0_10.wav".