Table of Contents
Fetching ...

ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification

Sara Atito, Muhammad Awais, Wenwu Wang, Mark D Plumbley, Josef Kittler

TL;DR

This work tackles the data-hungry nature of vision transformers (ViTs) in audio by proposing ASiT, a self-supervised framework that operates on log-mel spectrograms. ASiT combines Group Masked Model Learning with a teacher-student distillation scheme to learn both local token-level and global spectrogram representations, using reconstruction and contrastive objectives in a multi-task setup. Evaluated on AudioSet, ESC-50, Speech Commands, and VoxCeleb, ASiT pretrained on audio alone achieves state-of-the-art results across five audio/speech classification tasks, with a notable mAP of 48.0 on AS-2M and 38.6 on AS-20K, and strong performance on speech and speaker tasks without relying on image datasets for pretraining. The results demonstrate the effectiveness of in-domain self-supervised pretraining and the value of integrating local-context learning with global-instance discrimination for robust audio representations.

Abstract

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf{L}ocal-\textbf{G}lobal \textbf{A}udio \textbf{S}pectrogram v\textbf{I}sion \textbf{T}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.

ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification

TL;DR

This work tackles the data-hungry nature of vision transformers (ViTs) in audio by proposing ASiT, a self-supervised framework that operates on log-mel spectrograms. ASiT combines Group Masked Model Learning with a teacher-student distillation scheme to learn both local token-level and global spectrogram representations, using reconstruction and contrastive objectives in a multi-task setup. Evaluated on AudioSet, ESC-50, Speech Commands, and VoxCeleb, ASiT pretrained on audio alone achieves state-of-the-art results across five audio/speech classification tasks, with a notable mAP of 48.0 on AS-2M and 38.6 on AS-20K, and strong performance on speech and speaker tasks without relying on image datasets for pretraining. The results demonstrate the effectiveness of in-domain self-supervised pretraining and the value of integrating local-context learning with global-instance discrimination for robust audio representations.

Abstract

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf{L}ocal-\textbf{G}lobal \textbf{A}udio \textbf{S}pectrogram v\textbf{I}sion \textbf{T}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.
Paper Structure (13 sections, 8 equations, 7 figures, 4 tables)

This paper contains 13 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The proposed self-supervised framework (ASiT). For a given 10-second audio spectrogram, two random augmented views of 6-second each (clean spectrograms) are generated and fed to GMML based manipulation block to obtain the masked spectrograms. The clean and masked spectrograms are fed to the teacher and student networks, respectively. The recovery of the transformed information from the non-transformed class-token and data-tokens indicates that the network has learnt the semantics of the local as well as the global representation of the given audio and learnt useful inductive bias by learning local statistical correlation in the spectrogram.
  • Figure 2: Effect of longer pretraining.
  • Figure 3: Impact of masking percentage in pretraining.
  • Figure : (a) Input
  • Figure : (a) Input
  • ...and 2 more figures